Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers

Ibrahim, Atef; Gebali, Fayez

doi:10.3390/cryptography9020037

Open AccessArticle

Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers

by

Atef Ibrahim

^1,*

and

Fayez Gebali

²

¹

Department of Computer Engineering, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia

²

Department of Electrical and Computer Engineering, University of Victoria, Victoria, BC V8P 5C2, Canada

^*

Author to whom correspondence should be addressed.

Cryptography 2025, 9(2), 37; https://doi.org/10.3390/cryptography9020037

Submission received: 2 April 2025 / Revised: 17 May 2025 / Accepted: 26 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue Cryptography and Network Security—CANS 2024)

Download

Browse Figures

Versions Notes

Abstract

The broadening adoption of interconnected systems within smart city environments is fundamental for the progression of digitally driven economies, enabling the refinement of city administration, the enhancement of public service delivery, and the fostering of ecologically sustainable progress, thereby aligning with global sustainability benchmarks. However, the pervasive distribution of Internet of things (IoT) apparatuses introduces substantial security risks, attributable to the confidential nature of processed data and the heightened susceptibility to cybernetic intrusions targeting essential infrastructure. Commonly, these devices exhibit deficiencies stemming from restricted computational capabilities and the absence of uniform security standards. The resolution of these security challenges is paramount for the full realization of the advantages afforded by IoT without compromising system integrity. Cryptographic protocols represent the most viable solutions for the mitigation of these security vulnerabilities. However, the limitations inherent in IoT edge nodes complicate the deployment of robust cryptographic algorithms, which are fundamentally reliant on finite-field multiplication operations. Consequently, the streamlined execution of this operation is pivotal, as it will facilitate the effective deployment of encryption algorithms on these resource-limited devices. Therefore, the presented research concentrates on the formulation of a spatially and energetically efficient hardware implementation for the finite-field multiplication operation. The proposed arithmetic unit demonstrates significant improvements in hardware efficiency and energy consumption compared to state-of-the-art designs, while its systolic architecture provides inherent timing-attack resistance through deterministic operation. The regular structure not only enables these performance advantages but also facilitates future integration of error-detection and masking techniques for comprehensive side-channel protection. This combination of efficiency and security makes the multiplier particularly suitable for integration within encryption processors in resource-constrained IoT edge nodes, where it can enable secure data communication in smart city applications without compromising operational effectiveness or urban development goals.

Keywords:

smart cities; sustainable development goals (SDGs); edge computing; IoT security issues; protected data transmission; cryptography; algebraic field operations; concurrent processing; limited-resource devices

1. Introduction

The expanding network of interconnected devices, commonly referred to as the Internet of things (IoT), is poised to transform various aspects of daily life and industrial operations. This technological advancement is expected to influence diverse sectors, including home automation, urban development, healthcare, and transportation. Smart cities, in particular, play a pivotal role in fostering digital economies by leveraging IoT technologies to enhance urban management, improve citizen services, and promote sustainable development [1,2]. For instance, smart traffic systems utilize real-time data from sensors to optimize traffic flow, reduce congestion, and enhance safety for both drivers and pedestrians. Similarly, intelligent waste management systems can dynamically adjust collection routes based on the fill levels of waste bins, thereby minimizing fuel consumption and operational expenses. These advancements align closely with global sustainability goals, particularly Sustainable Development Goal 9 (Industry, Innovation, and Infrastructure) and Sustainable Development Goal 11 (Sustainable Cities and Communities).

However, the widespread deployment of smart city IoT devices introduces significant security challenges. These devices frequently process sensitive data, including personal health information, financial records, and location data. Their inherent vulnerabilities make them attractive targets for cyberattacks, which can compromise larger networks or disrupt critical infrastructure, such as power grids or emergency response systems. For example, a breach in a smart grid could result in extensive power outages, posing risks to public safety and economic stability. The unique constraints of IoT devices—such as limited processing power, restricted memory capacity, and the absence of standardized security protocols—further complicate the implementation of robust security measures [3]. Many IoT devices lack the computational resources to support advanced security software, rendering them susceptible to exploitation. Consequently, it is imperative to design IoT systems with integrated security features, including robust authentication mechanisms, data encryption, regular software updates to address vulnerabilities, and continuous monitoring to detect potential intrusions.

Given the resource-constrained nature of many IoT devices, implementing comprehensive security solutions remains a significant challenge. To address these issues, numerous initiatives are being pursued by governments, industry stakeholders, and academic institutions. These efforts aim to establish standardized security frameworks and best practices tailored to the unique requirements of IoT ecosystems. By embedding security considerations into the design and deployment of IoT technologies, it is possible to safeguard the infrastructure underpinning smart cities while advancing global sustainability goals. This proactive approach not only mitigates security risks but also bolsters public trust in the adoption of smart technologies, ensuring that the benefits of IoT can be realized without compromising safety or privacy.

In the context of resource-constrained IoT edge devices, various cryptographic methodologies are employed, with elliptic curve cryptography (ECC) standing out as a particularly effective solution. ECC is specifically designed for environments with limited computational resources, making it highly suitable for IoT applications where processing power and battery life are critical constraints. One of the key advantages of ECC is its ability to provide a high level of security using relatively small key sizes, which is especially beneficial for devices with restricted memory and processing capabilities.

At the core of optimized ECC cryptographic algorithms lies the implementation of finite-field arithmetic operations, among which finite-field multiplication plays a pivotal role [4]. This operation is fundamental to the efficient execution of the cryptographic algorithm, as it directly influences the speed and resource efficiency of various cryptographic tasks. Furthermore, other critical field operations, such as inversion, division, and exponentiation, are inherently dependent on finite-field multiplication as their foundational building block. This interdependence highlights the importance of developing efficient multiplication techniques to enhance the overall performance of cryptographic computations. As a result, significant research efforts have been dedicated to designing compact and highly efficient multiplier architectures tailored for cryptographic algorithms in IoT devices [5,6,7,8,9,10,11,12,13,14,15,16,17]. These architectures aim to minimize power consumption while maximizing computational speed, enabling real-time processing of cryptographic operations. Innovations such as parallel processing techniques, hardware-based multiplier implementations, and algorithmic optimizations have been explored to achieve these objectives. These advancements not only enhance the security of IoT applications but also enable the practical deployment of robust cryptographic mechanisms in resource-constrained environments. By addressing these challenges, such innovations contribute to the reliability and safety of smart city IoT ecosystems, ensuring that cryptographic operations can be performed efficiently even in devices with limited resources.

1.1. Literature Review

Selecting a fundamental formulation for elements within GF(

2^{m})

significantly affects the efficiency of multiplication in finite fields. Techniques such as polynomial basis (PB), normal basis (NB), dual basis (DB), and redundant basis (RB) provide distinct benefits designed for particular usage scenarios [18,19,20]. Notably, the normal basis (NB) exhibits pronounced advantages, especially within cryptographic applications. A key strength of NB is its exceptional streamlining of squaring operations. This streamlining is achieved through simple rotational shifts, necessitating little computational effort and time. Given the frequent use of squaring within cryptographic protocols, such as those involved in ECC and digital signatures, this feature renders NB particularly suitable for high-speed computation.

Beyond its effectiveness in squaring operations, the normal basis (NB) offers a stable and uniform representation of elements. This uniformity guarantees a consistent framework for arithmetic computations, resulting in reliable performance that facilitates hardware implementation. Such reliability is vital for constructing streamlined digital circuits, where predictable timing and efficient resource allocation are essential. Furthermore, the regularity inherent in NB promotes smoother data flow and reduces the complexity of control logic, which is particularly important in environments with limited resources. In contrast, other representations, such as the polynomial basis (PB), frequently display fluctuations in operational complexity, which can complicate hardware design and optimization efforts [21,22,23,24,25]. The inconsistencies associated with PB may lead to increased latency and higher power consumption, posing significant challenges for IoT devices where energy efficiency is crucial. Thus, the selection of basis representation not only impacts computational performance but also plays a critical role in shaping the architecture and effectiveness of cryptographic systems tailored for embedded applications.

Normal bases, while advantageous in many respects, exhibit specific limitations, particularly regarding the complexity of element multiplication. This process is often more intricate and demanding compared to other representations, such as polynomial bases or binary bases, which can significantly hinder computational efficiency, especially in applications reliant on frequent and rapid multiplication operations. To address these challenges, researchers have developed specialized variants of normal bases, including the optimal normal basis (ONB) and the Gaussian normal basis (GNB). The ONB minimizes the number of multiplications required for operations, enhancing speed without compromising security, while the GNB leverages specific mathematical properties to simplify multiplication and improve performance. These innovative variants aim to streamline the multiplication process while preserving the beneficial attributes of normal bases, thereby contributing to enhanced overall performance in various computational contexts, particularly in resource-constrained environments where efficiency is critical.

Given the challenges inherent to normal bases, the Dickson basis stands out as a strong alternative for developing efficient finite-field multipliers. Its spatial requirements are often comparable to those of optimal normal basis (ONB) multipliers, making it especially advantageous in resource-constrained environments, such as embedded systems and mobile devices, where minimizing hardware usage is essential for optimal performance. Researchers, including Hasan and Negre [18], have delved into the use of streamlined Dickson polynomials—like binomials and trinomials—to enhance the performance and efficiency of these multipliers. This research has led to significant reductions in complexity, latency, and power consumption, addressing critical concerns in modern computing. Additionally, Chiou et al. [19] have made notable progress by designing high-performance bidirectional systolic array multipliers that leverage the advantages of the Dickson basis. This work emphasizes its practical relevance in cryptographic applications, where fast and efficient parallel processing is crucial for handling large data sets. Furthermore, the Dickson basis simplifies the implementation of certain arithmetic operations, thereby improving overall system performance. These advancements collectively highlight the Dickson basis’s capacity to effectively address the limitations of traditional methods while ensuring sustained computational efficiency, positioning it as a promising option for future innovations in finite-field arithmetic.

The deployment of a bidirectional architecture, a design paradigm that is evident in numerous previous studies and implementations, including those outlined in detail by Chiou [19], poses significant and multifaceted challenges for the integration of effective error-detection systems. These systems are essential for mitigating the risks that are associated with side-channel attacks, which exploit fundamental vulnerabilities that are often present in elliptic curve cryptography implementations. This challenge underscores the urgent and pressing need for the development of resilient architectures that are capable of incorporating sophisticated security features without compromising overall system performance or efficiency. In contrast, Chiou et al. [20] proposed a unidirectional systolic array configuration that is specifically tailored for the Dickson basis multiplier, which facilitates the seamless integration of reliable error-detection mechanisms into the overall design. Nevertheless, the quadratic scaling of spatial complexity that is inherent in this particular design renders it unsuitable for deployment in compact, ultra-low-power devices such as RFID tags, where efficiency, minimal resource utilization, and compactness are of paramount importance in practical applications.

The choice of irreducible polynomials represents a fundamental cornerstone in determining not only the functionality but also the overall efficiency and performance of finite-field multipliers, as these polynomials underpin all arithmetic computations that occur within the system, thereby influencing various operational parameters. This critical selection significantly influences the operational efficiency of the multiplier, impacting both speed and resource utilization in practical applications [21,22,23,24,25]. While irreducible trinomials and pentanomials are commonly preferred choices due to their well-established computational efficiency, general polynomial-based multipliers remain highly versatile and applicable across a diverse array of use cases, including those requiring unique performance characteristics and specific operational constraints. Moreover, all-one irreducible polynomials, although less frequently employed in practice, can offer distinct advantages in specific contexts, thereby contributing to the development of optimized multiplier designs that meet specialized requirements and enhance overall system performance [26,27,28]. This diversity in polynomial selection not only broadens the scope and adaptability of finite-field multiplication but also ensures its continued relevance across a wide range of technological advancements and application-driven scenarios in contemporary computing environments, addressing the evolving needs of various industries.

Design methodologies exert a substantial influence on the resultant multiplier architectures, each demonstrating distinct operational characteristics and trade-offs that can significantly affect overall performance. For example, bit-serial multipliers are widely acknowledged for their reduced area and lower energy consumption, albeit requiring multiple clock cycles to execute a singular multiplication operation, which can be a limiting factor in certain applications [5,29,30]. Conversely, bit-parallel multipliers provide computational outcomes within a solitary clock cycle, but they typically necessitate greater hardware resources and elevated power usage, thus proving less suitable for deployment in resource-constrained environments where efficiency is paramount [7,8,11,31,32,33,34,35,36]. Within the domain of very large scale integration (VLSI) design, systolic and semi-systolic array architectures have garnered increased prominence due to their modular structure, scalability, and support for parallel processing, which enables them to handle complex computations effectively. These architectural frameworks are particularly well suited for high-throughput applications, facilitating efficient hardware resource utilization while preserving design flexibility, thus making them attractive options for modern engineering challenges.

Considerable scholarly investigation has been dedicated to the refinement of systolic and semi-systolic multiplier designs specifically for binary extension fields GF(

2^{m}

). Notable contributions include the development of fault-tolerant semi-systolic array multipliers by Lee and Chiou [5,37], which significantly enhance reliability in critical computations, particularly in systems where data integrity is paramount, as they enable the detection and correction of errors during arithmetic operations. Additionally, the endeavors of Huang et al. [6] to optimize both temporal and spatial resource utilization have led to improved overall system performance, allowing for more efficient use of hardware resources while minimizing latency in processing. Furthermore, Choi and Lee [8] introduced innovative systolic array architectures that facilitate concurrent multiplication and squaring operations, resulting in a substantial augmentation of modular exponentiation performance while minimizing hardware resource expenditure, thus proving advantageous in high-performance computing environments where speed is essential. The implementation of least significant bit (LSB)-first multiplication methodologies further enhances operational efficacy, rendering these designs particularly effective for real-world implementations, especially in applications requiring rapid data processing and reduced energy consumption. Collectively, these advancements underscore the ongoing evolution and significance of multiplier designs in modern computational contexts, highlighting their critical role in enhancing the efficiency and reliability of complex digital systems.

Recent scholarly investigations have increasingly focused on enhancing the speed and efficiency of multiplier architectures, particularly in contexts that necessitate rapid execution times and robust performance across diverse operational conditions. Chiou et al. [19] introduced a semi-systolic array multiplier specifically engineered to significantly reduce time complexity, thereby facilitating high-throughput multiplication in critical systems where expeditious processing is essential. Building upon this foundational work, Lee [38,39] developed semi-systolic Montgomery modular multipliers employing a dual-tier systolic computation approach, which not only enhances spatial resource efficiency but also effectively diminishes latency—key considerations for VLSI implementations that require optimized designs for practical applications. This innovative two-level strategy fosters effective concurrent processing and pipelining, thereby optimizing modular multiplication performance across various platforms and use cases.

Additionally, Mathe and Boppana [40] put forward a flexible multiplier architecture designed to handle both parallel and sequential inputs, thereby boosting performance across a variety of operand types and operational contexts. This adaptability is crucial for modern applications that demand high efficiency under varying computational loads. Ibrahim [15] contributed significantly by developing one-dimensional bit-serial and bit-parallel systolic array frameworks specifically tailored for computations over GF(

2^{m}

). This innovation not only enhanced resource utilization but also proved especially advantageous for applications like error-correcting codes and secure communication systems, where precision and reliability are paramount.

Pillutla and Boppana [23] made further advancements in the field by creating a polynomial basis systolic multiplier customized for particular field dimensions. This reflects a growing trend toward designs that are specifically tailored to meet unique application needs, ensuring higher performance in targeted scenarios. Furthermore, Lee’s creative approach using a Toeplitz matrix-vector representation effectively reduced the complexity associated with Montgomery-based bit-parallel multipliers. This simplification leads to designs that significantly improve both efficiency and performance, making them suitable for high-speed cryptographic applications [34].

Sarmadi’s innovative two-dimensional parallel systolic multiplier [35], based on the principles of the Montgomery algorithm, achieved remarkable operational speeds while effectively reducing spatial requirements. This design enables the concurrent processing of multiple operations, greatly enhancing overall throughput and making it suitable for high-demand applications. In a further enhancement of this concept, Mathe [36] introduced interleaved multiplication techniques within the framework of a two-dimensional parallel systolic multiplier. This approach not only maintains high-speed performance but also optimizes resource utilization, ensuring that hardware can operate efficiently even under heavy workloads. Together, these advancements reflect a dedicated effort to develop flexible multiplier architectures capable of meeting the rising demands of modern cryptographic applications. By focusing on improving both speed and resource efficiency, these innovations play a vital role in the evolution of computational methodologies in this essential field, ultimately leading to more robust and effective hardware solutions for the future.

1.2. Paper Contribution

The present study builds directly on the authors’ previous research referenced in [16,17]. Its novel contributions enhance the earlier work through a unique systolic array design, improved processing element architectures, and optimized signal routing schemes, collectively achieving superior performance compared to the prior designs. This research makes the following significant advancements in the development of efficient cryptographic multipliers tailored for resource-constrained IoT applications in smart cities.

It introduces an innovative computational systolic architecture specifically designed for a Dickson basis multiplier, aimed at achieving a more compact physical footprint and reduced energy consumption compared to existing architectural solutions. In contrast to prior scholarly contributions, which often employed ad hoc design methodologies and neglected the optimization of critical performance metrics such as latency, throughput, and energy dissipation, the proposed approach emphasizes a systematic framework for the intentional distribution of temporal and spatial data. This strategic allocation enables the development of an architecture tailored to meet specific application requirements effectively. Furthermore, a formalized analytical framework underpins a comprehensive evaluation of the multiplier’s configuration, facilitating the identification of operational enhancements. Through this rigorous analysis, the proposed design not only addresses the limitations of previous approaches but also provides a pathway for optimizing performance in diverse computational environments. This advancement signifies a meaningful contribution to the field, offering a refined methodology that aligns with contemporary demands for efficiency and effectiveness in multiplier architecture.
The structural design of the multiplier, as delineated in this study, reveals a substantial reduction in hardware complexity compared to conventional two-dimensional architectures, exhibiting linear rather than quadratic growth in resource requirements. This streamlined architecture yields significant improvements in both physical footprint and energy consumption while maintaining competitive processing speeds comparable to traditional designs. The systolic implementation provides inherent timing-attack resistance through deterministic operation and enables future integration of error-detection mechanisms, combining efficiency with security advantages. Its modular configuration with direct PE-to-PE communication minimizes signal delays and optimizes data exchange, making it particularly suitable for VLSI implementations in resource-constrained IoT edge nodes. These combined attributes—reduced hardware complexity, maintained throughput, inherent security features, and VLSI-friendly design—position the multiplier as an optimal solution for secure encryption processors in smart city applications, where operational efficiency and resource optimization are paramount.
The substantial hardware and energy efficiencies achieved through the proposed multiplier architecture position it as a highly effective solution for compact IoT edge devices within smart cities, which often function under significant resource limitations. These devices are typically required to operate within stringent power constraints while maintaining high performance levels for applications such as intelligent traffic management, environmental monitoring, and surveillance systems. By adeptly addressing the challenges associated with resource-constrained environments, this architecture facilitates the implementation of sophisticated features while ensuring optimal performance and energy efficiency.
Furthermore, the efficacy of the proposed multiplier is crucial for enabling the implementation of cryptographic algorithms on resource-constrained IoT devices, thereby supporting secure identification and access control mechanisms. As these systems increasingly depend on cryptographic algorithms to protect sensitive data, the reduction in energy consumption and hardware complexity facilitates the deployment of robust security measures within IoT devices, ensuring the confidentiality and integrity of user information. Given the rising demand for inclusive and accessible technology, the development of such secure solutions is essential for creating environments that foster independence and accessibility while addressing the diverse needs of urban populations. By facilitating the seamless integration of efficient cryptographic solutions, the proposed design significantly enhances the resilience and security of smart city infrastructures, ultimately contributing to a safer and more reliable urban ecosystem.
Within the context of smart city deployments, where a multitude of networked devices must function efficiently within stringent spatial and power constraints, the aforementioned design presents substantial benefits. By reducing the requisite silicon area, the proposed multiplier facilitates the creation of more compact integrated circuits, enabling seamless integration into diverse sensor and device deployments across urban landscapes. Furthermore, its diminished power consumption ensures prolonged operational lifespans for these devices, mitigating the need for frequent recharging or maintenance, a critical factor for the sustainability of smart city initiatives. Consequently, the proposed architecture not only enhances the computational performance of cryptographic functions but also contributes to the overall operational efficiency and extended lifespan of IoT systems, thereby establishing it as a vital element for the development of future smart urban infrastructure.

1.3. Paper Organization

The paper is structured to provide a thorough exploration of the topic. Section 2 offers a concise summary of the Dickson basis multiplication technique, laying the groundwork for the development of the new systolic structure. Section 3 provides a high-level overflow of the systolic design flow. Section 4 analyzes the dependency graphs (DGs) related to the Dickson-based multiplier, investigating the complex interrelationships among operations and how these affect overall efficiency. This analysis reveals significant insights into data movement and critical performance pathways. In Section 5, we discuss the architecture and design of the proposed systolic Dickson basis multiplier, focusing on its unique features and expected improvements. Section 6 offers a comparative analysis of performance metrics for various multipliers, including those using the Dickson basis, assessing their effectiveness in IoT applications within resource-constrained smart city environments. Section 7 analyzes the proposed architecture’s security, highlighting protections against side-channel attacks and vulnerabilities to be addressed with future error-detection and masking techniques. Finally, Section 8 summarizes the key findings and suggests potential directions for future research, aiming to inspire further advancements in finite-field arithmetic and multiplier technologies.

2. Multiplication Using the Dickson Basis in GF( $2^{m}$ )

The multiplication procedure using the Dickson basis in GF(

2^{m}

) is outlined in various sources [16,17,18,19,20]. In this section, we provide a brief overview of this multiplication method to serve as a foundation for developing the new systolic structure.

Suppose the Dickson basis is

Φ = {ϕ_{1}, ϕ_{2}, \dots, ϕ_{q}}

, and the irreducible binomial

H = ϕ_{q} + 1

is employed to generate field elements. Within the finite field GF(

2^{q}

), elements

E, M, N

are represented using this basis as:

E = e_{1} ϕ_{1} + e_{2} ϕ_{2} + \dots + e_{q} ϕ_{q}

,

M = m_{1} ϕ_{1} + m_{2} ϕ_{2} + \dots + m_{q} ϕ_{q}

, and

N = n_{1} ϕ_{1} + n_{2} ϕ_{2} + \dots + n_{q} ϕ_{q}

, where

e_{i}, m_{i}, n_{i}

are members of GF(2) for

1 \leq i \leq q

. The element N is defined as the remainder upon dividing the product of E and M by H, which can be shown as

N = E M mod H

. This relationship is vital for several applications in the field of resource-constrained cryptography.

Based on the observations detailed in [16,17,18,20], when the irreducible binomial

H = ϕ_{q} + 1

is employed in the element generation process, the property

ϕ_{q + i} = ϕ_{i} + ϕ_{q - i}

is valid for all non-negative integers i. Therefore, the product N can be determined by applying this property, resulting in:

\begin{matrix} N = E M = \underset{︸}{\sum_{i, j = 1}^{q} e_{i} m_{j} ϕ_{i + j}} + \underset{︸}{\sum_{i, j = 1}^{q} e_{i} m_{j} ϕ_{| i - j |}} \end{matrix}

(1)

The expression given in Equation (1) can be transformed into a matrix representation, a crucial step for the precise development of a systolic multiplier. The resulting product N is obtained through the computation of three separate matrix-vector multiplications [16,17,18,20], namely

N_{1}

,

N_{2}

, and

N_{3}

, as indicated in Equation (2).

\begin{matrix} N = \underset{N 1}{\underset{︸}{[\begin{matrix} e_{q} & e_{m - 1} & e_{q - 2} & \dots & e_{1} \\ e_{1} & e_{q} & e_{q - 1} & \dots & e_{2} \\ e_{2} & e_{1} & e_{q} & \dots & e_{3} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ e_{q - 1} & e_{q - 2} & e_{q - 3} & \dots & e_{q} \end{matrix}] \times [\begin{matrix} m_{1} \\ m_{2} \\ m_{3} \\ ⋮ \\ m_{q} \end{matrix}]}} \\ + \underset{N 2}{\underset{︸}{[\begin{matrix} 0 & e_{1} & e_{2} & \dots & e_{q - 1} \\ 0 & 0 & e_{1} & \dots & e_{q - 2} \\ 0 & 0 & 0 & \dots & e_{q - 3} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 0 \end{matrix}] \times [\begin{matrix} m_{1} \\ m_{2} \\ m_{3} \\ ⋮ \\ m_{q} \end{matrix}]}} \\ + \underset{N 3}{\underset{︸}{[\begin{matrix} e_{q - 1} & 0 & e_{q - 1} & \dots & e_{2} \\ e_{q - 2} & e_{q - 1} & 0 & \dots & e_{3} \\ e_{q - 3} & e_{q - 2} & e_{q - 1} & \dots & e_{4} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & e_{1} & e_{2} & \dots & e_{q - 1} \end{matrix}] \times [\begin{matrix} m_{q} \\ m_{q - 1} \\ m_{q - 2} \\ ⋮ \\ m_{1} \end{matrix}]}} \end{matrix}

(2)

3. Systolic Design Methodology

The design of systolic architectures follows a structured methodology that transforms computational algorithms into efficient, parallel hardware implementations. The process begins with a careful analysis of the target algorithm—in this case, finite-field multiplication using the Dickson basis. The algorithm is first decomposed into fine-grained operations, exposing inherent parallelism and data dependencies. This step is crucial because it determines how computations can be distributed across processing elements (PEs) while maintaining high throughput and low latency. Next, the algorithm is represented as a dependency graph (DG), where nodes correspond to computational tasks and edges define the flow of data between them. For our multiplier, this means mapping the three matrix-vector products (

N 1

,

N 2

,

N 3

) into separate DGs. The DG captures the spatial and temporal relationships between operations, ensuring that computations are scheduled in a way that maximizes parallelism while respecting data dependencies.

The core innovation in systolic design lies in the space-time transformation, where the abstract DG is converted into a physical array of interconnected PEs. This involves two key steps:

Scheduling: Assigning each node in the DG to a specific clock cycle, ensuring that dependent operations execute in the correct order. This is controlled by a scheduling vector, which dictates the computation sequence.
Projection: Mapping multiple nodes onto a single PE to optimize hardware reuse. A projection matrix determines how nodes are merged, balancing resource efficiency with performance.

The result is a systolic array—a highly regular, pipelined structure where data flows rhythmically between PEs, much like blood circulating through the heart (hence the term “systolic”). In our design, this yields three linear arrays, each handling one matrix-vector product. The PEs are simple and enable scalable and modular hardware that is ideal for VLSI implementation.

4. Formulating Dependency Graphs

To clarify the operational interdependencies and structural patterns observed in Equation (2), we utilize visual representations through DGs. The DGs of the matrix-vector multiplications,

N 1

,

N 2

, and

N 3

, outlined in Equation (2), have been extracted by the authors in a previous work [17], as illustrated in Figure 1, Figure 2 and Figure 3. These matrix-vector multiplications employ a consistent computational methodology, with the only variation being the order in which input values are presented to the processing node.

Figure 1 illustrates the DG for

N 1

, featuring: (1) input signals

m_{i}

, with i taking values from 1 to q, introduced from the left of the DG; (2) the uppermost portion of the DG serves as the initial input point for the zero entries of the

N 1

signals; (3) corner connectors positioned at the left extremities of the input nodes are utilized to combine the input signals

e_{j - 1}

and

e_{q - i + 1}

, where j takes values from 2 to q and i takes values from 1 to q; and (4) the ultimate output bits for

N 1

are produced at the bottom of the DG.

Figure 2 depicts the DG for

N 2

, highlighting: (1) input signals

m_{i}

, where i takes values from 1 to q, enter the DG from the left; (2) the topmost DG section is the point of entry for initial zero values of

N 2

; (3) angled lines at input node edges facilitate the introduction of initial zero inputs for signals

e_{j - 1}

and

e_{i - 1}

, where j takes values such that

2 \leq j \leq q

and i takes values such that

1 \leq i \leq q

; and (4) the final output bits for

N 2

are generated at the bottom of the DG.

Figure 3 showcases the DG for

N 3

, which includes: (1) input signals

m_{q - i + 1}

, where i takes values from 1 to q, are introduced via lateral flow from the DG’s left edge; (2) the top of the DG initializes

N 3

with zero values; (3) angled connectors on input nodes integrate coefficients of

e_{q - j}

and

e_{q - i}

, where i and j takes values from 1 to q, with

e_{q}

consistently zero for accuracy; and (4) the final out bits for

N 3

are produced at the lower section of the DG.

The resulting outcome, N, is derived by synthesizing the results generated from the contributions of

N 1

,

N 2

, and

N 3

. This synthesis is essential for integrating the individual components into a complete result. To achieve this, binary addition is performed using two-input XOR gates, which facilitate the combination of the values.

5. Exploring Dickson-Based Systolic Multiplier Framework

This section details the development of the systolic multiplier framework, employing the methodology described in [16,17,41] Central to this development are advanced scheduling and node projection techniques, which are strategically applied to the DGs to achieve a highly optimized parallel multiplier architecture, ensuring optimal operation sequences and streamlined data flow based on the selected Dickson basis multiplication. Specifically, we apply this innovative methodology to the DGs illustrated in Figure 1, Figure 2 and Figure 3, highlighting the effectiveness of the scheduling and node projection techniques in enhancing the overall multiplier design. Notably, these DGs share an identical structure, differing only in the order of input coefficient presentation. Therefore, the application of this methodology consistently results in a single systolic array architecture suitable for all three matrix-vector multiplications:

N 1

,

N 2

, and

N 3

.

The scheduling process is thoroughly detailed in a recent publication by the authors [17]. By applying this comprehensive methodology to each node of the DG, we can explore a range of suitable scheduling vectors

Z

that effectively assign time values to the nodes, optimizing their execution order. One particularly effective scheduling vector that leads to an efficient systolic array design, ensuring minimal latency and resource utilization, is [17]:

\begin{matrix} Z & = & [\begin{matrix} 1 & 0 \end{matrix}] \end{matrix}

(3)

The details of the projection process are also outlined in [17]. This approach is employed to map multiple nodes of the DG onto a single processing element (PE), thereby optimizing hardware reuse through the use of a projection matrix

T

. This matrix plays a crucial role in determining how nodes are combined, striking a balance between resource efficiency and performance. By following the outlined process, we can derive the following projection matrix

T

, which facilitates the creation of the parallel systolic structure with the desired characteristics of minimal latency and efficient resource utilization.

\begin{matrix} T & = & [\begin{matrix} 1 & - 1 \end{matrix}] \end{matrix}

(4)

Retrieving the Systolic Multiplier Design

As outlined in [17], the vectors

Z = [1 0]

and

T = [1 - 1]

are employed to compute the scheduling function

G (X)

and the projection function

\bar{X} (X)

for each node

X (i, j)

present in the DGs of Figure 1, Figure 2 and Figure 3. The derived functions,

G (X)

and

\bar{X} (X)

, are pivotal for optimizing system-wide performance, ensuring a correct sequence of operational execution and efficient resource utilization. These functions are subsequently formulated as:

\begin{matrix} G (X) = i \\ \bar{X} (X) = i - j \end{matrix}

(5)

Employing the explored functions

G (X) = i

and

\bar{X} (X) = i - j

on the DG nodes of Figure 1, Figure 2 and Figure 3 allows us to establish the execution timing, represented by

G (X)

, and the PE index, represented by

\bar{X} (X)

, for each node. Specifically,

G (X)

provides the clock cycle at which a node’s computation should begin, while

\bar{X} (X)

assigns each node to a particular PE within the systolic array. However, the chosen projection function,

\bar{X} (X) = i - j

, might produce negative indices for certain PEs, which are not physically realizable in a hardware implementation. To eliminate these negative indices and ensure all PEs have positive or zero indices, a nonlinear mapping operator can be employed, defined as:

\begin{matrix} \bar{X} (X) = (i - j) mod q \end{matrix}

(6)

Due to the structural and dependency similarities across the three DGs representing the

N 1

,

N 2

, and

N 3

matrix-vector multiplications, a single, representative DG structure, as illustrated in Figure 4, is employed for assigning scheduling times and PEs’ indices. This unification streamlines the design process, simplifying both the analysis and subsequent hardware implementation. The shaded gray rectangles superimposed over each row of nodes within the DG visually represent the scheduling times, indicating the specific clock cycles at which the computations associated with those nodes are to be executed. These rectangles effectively depict the temporal dimension of the algorithm’s operation. Conversely, the pentagonal shapes, each corresponding to a distinct computational node, contain the respective PE indices. These indices indicate the specific physical PEs within the systolic array that are responsible for performing the computations associated with each node. Notably, it is observed that all nodes lying along the same diagonal within the DG are assigned to the same PE. This allocation reflects the projection function’s role in mapping multiple computational nodes onto a single PE, effectively consolidating operations that can be performed by the same hardware resource. This mapping strategy minimizes hardware redundancy and optimizes resource utilization. The total number of PEs required for the systolic array is q, where q corresponds to the dimension of the matrix-vector products. This linear array structure, comprising q PEs, is designed to efficiently process the data, ensuring high throughput and minimal latency, thereby enhancing the overall performance of the multiplier.

Figure 4 facilitates the visualization of a bit-parallel systolic multiplier, implemented as shown in Figure 5, which is specifically designed for high-throughput multiplication operations within finite fields. This structure, detailed in Figure 5, comprises three distinct one-dimensional systolic arrays, each dedicated to computing one of the partial products (

N 1

,

N 2

, and

N 3

) resulting from the Dickson basis multiplication. Each of these arrays consists of q PEs arranged linearly, enabling efficient data propagation and localized computations. This linear arrangement supports a pipelined data flow, where each PE performs a portion of the multiplication operation in parallel, contributing to the overall result. The bit-parallel nature of the design means that all bits of the operands are processed simultaneously within each PE, significantly enhancing computational speed and leading to substantial performance gains, particularly in applications requiring real-time processing of large data sets. By processing all bits concurrently, the systolic array minimizes latency and maximizes throughput, making it highly efficient for high-speed arithmetic operations. Furthermore, the regular and localized interconnections between PEs simplify the hardware implementation and reduce the complexity of routing signals, contributing to the overall efficiency and scalability of the design.

As depicted in Figure 5, the upper systolic array is dedicated to computing the product of the matrix-vector multiplication

N 1

(

n_{t}

), which contributes to the overall result of the Dickson basis multiplication. Conversely, the middle systolic array is responsible for computing the product of the matrix-vector multiplication

N 2

(

n_{i}

), while the lower systolic array is tasked with computing the product of the matrix-vector multiplication

N 3

(

n_{l}

), completing the set of partial products required for the final multiplication output.

The internal logic of the PEs within these systolic arrays is meticulously designed and detailed in Figure 6 and Figure 7. These PEs are engineered to execute the requisite computations to derive the partial products of the matrix-vector multiplications

N 1

,

N 2

, and

N 3

based on the input data they receive.

In Figure 5, it is evident that the initial values of the coefficient bits for

n_{t}

,

n_{i}

, and

n_{l}

are consistently zero. This observation presents an opportunity for significant optimizations in terms of both area and delay complexities within the systolic multiplier design. Specifically, by strategically resetting the

D_{n}

flip-flops, as depicted in Figure 6 and Figure 7, at the precise moment when these initial zero values are required, we can streamline the computational process. This resetting action directly presents the necessary zero values at the inputs of the XOR gates, effectively circumventing the need for additional logic circuitry that would otherwise be required to compute or generate these zero values. This optimization not only reduces the overall hardware footprint of the multiplier but also minimizes the propagation delay, thereby enhancing the operational speed and efficiency of the system. By leveraging the inherent characteristics of the initial conditions, we can achieve a more compact and faster systolic multiplier implementation.

When comparing the current work to the authors’ previous research [16,17], several critical differences become evident. The systolic array presented in [16] operates as a hybrid design that processes the three matrix-vector products sequentially, resulting in a longer latency of

q + 2

clock cycles. This approach effectively triples the computation time compared to our current fully parallel architecture, which achieves a latency of just q cycles. This transition not only improves throughput but also enhances overall efficiency, enabling faster calculations in practical scenarios. Moreover, the processing elements in [16] feature fundamentally different structures and interconnection patterns, particularly in how signals are routed. In contrast, our current design employs a more streamlined architecture that boosts signal integrity and reduces propagation delays.

In relation to the findings in [17], the current implementation introduces significant architectural advancements. Notably, diagonal signals e are directly assigned to each processing element, removing the need for pipelining through latches. This change simplifies the signal flow and decreases latency. Additionally, the horizontal signal m is broadcast to all elements without intermediate latching, which further optimizes performance. Output signals n are pipelined between elements rather than being accumulated locally using latches, enhancing data throughput. These modifications result in a remarkable reduction of latches by 50%, while keeping the number of AND/XOR gates the same. This not only decreases chip area but also reduces power consumption, leading to a more energy-efficient design. As a result, the operational behaviour of the current architecture differs significantly due to these altered signal flows and interconnects, facilitating improved performance metrics across various applications.

The presented systolic multiplier design also offers a significant reduction in hardware footprint compared to previously published two-dimensional parallel systolic architectures. These conventional designs typically require a hardware area that scales quadratically with the field size, specifically

O (q^{2})

, where q denotes the field dimension. In contrast, the current design achieves a linear scaling,

O (q)

, resulting in notably more efficient hardware resource utilization. This linear scaling is particularly advantageous for implementations targeting environments with limited hardware resources, such as embedded systems and IoT devices.

When directly compared to Montgomery and Dickson two-dimensional parallel systolic architectures, as documented in [18,19,20,38,39], the proposed multiplier demonstrates a clear advantage in spatial efficiency. The reduced hardware area directly contributes to improved hardware resource utilization, rendering the proposed multiplier superior in terms of area efficiency. Moreover, when assessed against parallel multipliers based on standard finite field arithmetic techniques, as detailed in [7,35,36,42,43], the proposed multiplier exhibits a decrease in hardware area and potential enhancements in power consumption. The specific performance data, including detailed comparisons of hardware area, power dissipation, throughput, and latency, will be presented in the results section. These data will demonstrate the advantages of the proposed multiplier in terms of both hardware area and power efficiency, emphasizing its suitability for resource-constrained IoT applications. The linear systolic array configuration also simplifies the control circuitry and minimizes signal routing complexity, further enhancing its overall efficiency.

The architecture of the developed parallel systolic multiplier, as derived from Figure 5, is meticulously designed to optimize performance and resource utilization. The layout and signal flow can be detailed as follows:

1.

Input Signals Allocation:

Input signals $e_{j - 1}$ , 0, and $e_{q - j}$ , where $1 \leq j \leq q$ , are specifically assigned to the input port $e_{s}$ in every PE across the upper, middle, and lower systolic arrays, respectively. This assignment is visually represented in Figure 5, Figure 6 and Figure 7.
Input signals $e_{i - 1}$ and $e_{q - i}$ , where $1 \leq i \leq q$ , are directed to the input port $e_{f}$ in the regular PEs ( ${PE}_{b}$ ) of the middle and lower systolic arrays, as seen in Figure 5 and Figure 7.
Input signals $e_{q - i + 1}$ in the upper systolic array coincide with the input signals $e_{j - 1}$ . Consequently, there is no necessity for an additional input port. Therefore, the input port $e_{f}$ is omitted from the PEs of the upper systolic array ( ${PE}_{t}$ ). This simplification reduces the complexity of the PEs in the upper array compared to the regular PEs ( ${PE}_{b}$ ) in the middle and lower arrays, as illustrated in Figure 6 and Figure 7.

2.

Sequential Input Signals:

Input signals $m_{i}$ , where $1 \leq i \leq q$ , are fed sequentially into the first processing element ( ${PE}_{0}$ ) of the upper and middle systolic arrays. These signals then propagate through all the PEs within their respective arrays.
Similarly, the input signals $m_{q - i + 1}$ , where $1 \leq i \leq q$ , are sequentially input into the first processing element ( ${PE}_{0}$ ) of the lower systolic array and subsequently pass through all its regular PEs.

3.

Intermediate Signal Handling:

The control signal g is introduced at the second processing element ( ${PE}_{1}$ ) of the middle and lower systolic arrays. This signal is then pipelined through the regular PEs ( ${PE}_{b}$ ) of these arrays to activate the lower tri-state buffer, as depicted in Figure 7. This mechanism ensures the accurate and timely assignment of signals to the port $e_{f}$ , maintaining precise timing and control within the systolic array.
The intermediate values of the bit n are pipelined between the PEs of all three systolic arrays, as shown in Figure 5, Figure 6 and Figure 7. This pipelining facilitates the computation of the final coefficient bits $n_{t}$ , $n_{i}$ , and $n_{l}$ .

4.

Parallel Output:

The resulting coefficient bits $n_{t}$ , $n_{i}$ , and $n_{l}$ from the upper, middle, and lower systolic arrays, respectively, are accessible concurrently at the outputs of their respective PEs after q clock cycles.

5.

Final Product Calculation:

The final product bits $n_{j}$ , where $1 \leq j \leq q$ , are obtained at clock cycle q by performing a bitwise XOR operation (using two-input XOR gates) on the corresponding bits of $n_{t}$ , $n_{i}$ , and $n_{l}$ , as illustrated in Figure 5. This final step completes the multiplication process, providing the desired product output.

The operational sequence of the analyzed bit-parallel systolic multiplier structure is precisely orchestrated to ensure efficient and accurate computation. Here is a detailed breakdown of the process:

1.

Initialization Phase:

During the initial clock cycle, a crucial reset operation is performed. The $D_{n}$ latches, which are integral to the PEs as depicted in Figure 6 and Figure 7, are reset. This action forces the coefficient bits, denoted as n, to assume a zero value, establishing the initial state for the computation.
Simultaneously, the control signal g is deactivated, setting it to a logical low ( $g = 0$ ). This deactivation enables the input signals assigned to the port $e_{s}$ ( $e_{j - 1}$ , 0, and $e_{q - j}$ , where $1 \leq j \leq q$ ) to pass unimpeded through the upper tri-state buffer, as illustrated in Figure 7. These signals are then correctly allocated to their respective PEs within the systolic array layout.
Concurrently, the first bits of the input signals $m_{i}$ and $m_{q - i + 1}$ ( $1 \leq i \leq q$ ) are sequentially fed into the first processing element ( ${PE}_{0}$ ) of each corresponding systolic array through the input port m, as shown in Figure 5. These signals, once received at ${PE}_{0}$ , are directly broadcast to all subsequent PEs within each systolic array, ensuring uniform data distribution.

2.

Computation Phase:

Starting from the second clock cycle and extending through the $q^{t h}$ clock cycle, the control signal g is activated, setting it to a logical high ( $g = 1$ ). This activation enables the intermediate signals $e_{f}$ to be allocated to the regular processing elements ( ${PE}_{b}$ ) for the computation of the intermediate values of the signals designated for port n, as depicted in Figure 5. This process is central to the iterative computation of the multiplication result.
During these clock cycles, the subsequent components of the input signals $m_{i}$ and $m_{q - i + 1}$ ( $1 \leq i \leq q$ ) are sequentially introduced into the first processing element ( ${PE}_{0}$ ) of each corresponding systolic array through the input port m, as shown in Figure 5. As in the initialization phase, these signals are directly broadcast to all subsequent PEs within each systolic array, maintaining data flow and parallelism.

3.

Parallel Output Phase:

At the $q^{t h}$ clock cycle, the final result n yields its ultimate parallel output coefficient bits, indicated as $n_{j}$ (where $1 \leq j \leq q$ ). These bits are produced simultaneously from the last row of XOR gates, as illustrated in Figure 5. This simultaneous generation of output bits signifies the completion of the multiplication operation, delivering the final product in a parallel format.

6. Results Summary and Insights

This section presents a comprehensive comparative analysis, systematically evaluating the investigated systolic multiplication architecture against a variety of prominent systolic and semi-systolic multiplication designs documented in the literature [19,20,38,39,42]. The analysis is structured into two distinct phases to provide a holistic understanding of the proposed framework’s performance and practicality.

The first phase focuses on a detailed comparison of the hardware footprint and computational latency between the proposed architecture and existing designs. By meticulously examining these parameters, we aim to elucidate the inherent trade-offs between hardware resource utilization and operational speed. This analysis provides critical insights into the efficiency and scalability of the proposed framework relative to state-of-the-art alternatives.

The second phase shifts from theoretical analysis to empirical validation, aiming to verify our complexity assessments through practical implementation. Specifically, the proposed architecture will be realized in a tangible environment using ASIC implementation. This step allows us to measure its actual performance metrics and compare them against the theoretically predicted computational requirements. Such empirical validation is critical to ensure that our analytical findings accurately reflect the architecture’s behavior in real-world scenarios, thereby enhancing the reliability and credibility of our conclusions. Together, these two phases provide a robust and holistic evaluation of the proposed systolic multiplication architecture, bridging the gap between theoretical insights and practical application.

6.1. Complexity Analysis

Upon a thorough examination of the systolic architecture depicted in Figure 5, focusing on its data flow and inter-PE communication, it is evident that the design incorporates a cumulative count of

3 q

PEs, each contributing integrally to the system’s operational efficacy. Specifically, these PEs are arranged in a linear array, facilitating pipelined data processing. Within the confines of these PEs, a heterogeneous assortment of logical primitives is deployed, namely

3 q

two-input AND gates,

3 q

two-input exclusive gates (XOR), a null quantity of data selectors (MUXes), and

3 q

storage elements (latches), ensuring synchronized data transfer between PEs. These constituent elements operate in a coordinated manner, with precisely timed control signals, to perform the requisite bit-level arithmetic operations within each individual PE, contributing to the overall multiplication process. The absence of multiplexers suggests a fixed data flow, simplifying the control logic and reducing hardware overhead.

For the derivation of the terminal result bits, denoted as

n_{j}

, a supplementary set of

2 q

exclusive gates (XOR) is employed, dedicated to this particular function. These gates are responsible for the binary summation of homologous bits originating from

n_{t}

,

n_{i}

, and

n_{l}

, effectively consolidating the intermediate results. Consequently, the aggregate count of exclusive gates (XOR) necessitated within the architecture reaches

5 q

.

To assess the temporal efficiency of the proposed multiplication unit, specifically its operational speed, the determination of the critical path delay (CPD) is paramount. The CPD, by definition, represents the longest sequence of logical operations within the circuit, from input to output, thereby establishing the upper bound on the circuit’s achievable clock frequency and, consequently, the maximum data processing rate. In the context of this architecture, the CPD is constituted by a serial concatenation of two two-input exclusive gates, specifically those involved in the final stage of result accumulation, resulting in a cumulative delay of

2 Δ_{X}

, where

Δ_{X}

signifies the propagation delay of a single two-input exclusive gate (XOR).

A functional analysis of the multiplier’s operational sequence reveals that the proposed architecture attains its final output product within a temporal span of q clock cycles. This implies that the entire computational process, from the initiation of the multiplication operation to the generation of the complete result bit-stream, is executed within q discrete clock intervals. This temporal characteristic is of paramount importance for the determination of the multiplier’s performance, thereby facilitating a comprehensive assessment of its suitability for deployment in time-critical applications.

Table 1 presents a comprehensive comparative analysis, contrasting the proposed systolic multiplication unit with a range of established parallel systolic and semi-systolic multiplier architectures from the literature [19,20,38,39,42]. The comparison focuses on three key parameters: architectural composition, operational latency, and CPD. The architectural composition is evaluated in terms of the number of logical gates, data selectors (MUXs), and storage elements (latches), providing insight into the hardware complexity and resource utilization of each design. Operational latency, measured in clock cycles, reflects the time required to complete the multiplication process, highlighting the efficiency of each architecture in handling computational tasks. Finally, the critical path delay (CPD) determines the maximum achievable clock frequency, which directly impacts the overall performance and suitability of the multiplier for high-speed applications.

Through a structured and methodical examination of these critical parameters, Table 1 facilitates a transparent and granular comparison, enabling a rigorous assessment of the proposed architecture’s merits and limitations relative to established methodologies. This analysis not only highlights the inherent strengths of the presented multiplication unit—such as a reduced hardware footprint, improved latency, or enhanced critical path efficiency—but also identifies potential areas for future optimization. By doing so, it contributes to a more nuanced understanding of the architecture’s suitability for deployment in real-world, resource-constrained, and performance-sensitive applications.

The data presented in the table provides valuable insights for designers, allowing them to make informed decisions when selecting a multiplier architecture. Key considerations include area efficiency, operational speed, and compatibility with broader system integration requirements. For instance, a design with fewer logical gates and storage elements may offer advantages in resource-constrained environments, while a lower CPD can significantly enhance performance in high-speed applications. By systematically evaluating these factors, the table serves as a practical tool for balancing trade-offs and aligning architectural choices with specific application demands, ultimately supporting the development of efficient and scalable multiplier designs.

The comparative results reveal a distinct disparity in hardware resource utilization between the presented architectural framework and the previously documented designs. Specifically, the multiplier topologies referenced in the existing literature exhibit a quadratic growth in hardware requirements, denoted as

O (q^{2})

, indicating that their constituent component count scales quadratically with respect to the input size q. Conversely, the single-stream systolic multiplier demonstrates a superior linear scaling of hardware resources, represented as

O (q)

, signifying a substantial reduction in hardware footprint. This resource efficiency is of paramount importance for deployment in smart city Internet of things (IoT) ecosystems, where limitations on both hardware resources and physical deployment space are prevalent constraints.

Furthermore, the comparative study indicates that all examined designs possess an equivalent asymptotic computational time complexity of

O (q)

. This uniformity in temporal performance signifies that the proposed systolic multiplier achieves computational throughput comparable to existing designs, while concurrently demanding significantly fewer hardware resources. This equilibrium between computational performance and resource economy renders the proposed architecture a highly desirable solution for practical implementation, particularly in contexts where efficient resource allocation and management are critical design considerations.

The investigated systolic multiplication architecture exhibits a suite of advantageous characteristics, thereby broadening its suitability for deployment within smart city IoT ecosystems. A principal benefit lies in its minimized spatial footprint, achieved through a compact architectural design that effectively curtails area requirements and optimizes the utilization of available hardware resources. This judicious use of physical space not only facilitates a reduction in the system’s overall dimensions but also yields positive ramifications for key performance indicators.

Consequently, the spatial efficiency of the proposed design directly translates to improvements in both the area-delay product (ADP) and the power-delay product (PDP). These enhancements manifest as enhanced overall performance metrics and heightened energy efficiency, rendering the proposed architectural arrangement a particularly compelling solution for smart city IoT applications characterized by stringent resource limitations and critical power consumption considerations.

The merits of the offered multiplier’s architectural design are corroborated by the empirical implementation data presented in Table 2. These experimental findings validate the theoretical claims regarding the reduction in hardware resource utilization, as well as the improvements observed in both the ADP and the PDP.

The achievement of diminished resource consumption, without a concurrent degradation in operational performance, translates to substantial practical advantages. This characteristic is especially salient in the context of smart city IoT deployments, where factors such as energy expenditure, efficient allocation of physical space, and overall operational effectiveness are of critical importance.

6.2. Implementation Findings

The offered systolic multiplication architecture underwent a rigorous comparative evaluation against established systolic and semi-systolic multiplier implementations [19,20,38,39,42], employing a comprehensive methodology. The development and instantiation of the diverse multiplier configurations were executed using the VHDL hardware description language, enabling precise representation of hardware behavior. For the synthesis stage, the Synopsys Design Compiler was employed, coupled with the Nangate 15 nm Open Cell Library operating at 0.8 V. This library was selected for its capability to deliver accurate estimations of hardware area, propagation delay, and power dissipation at a detailed gate-level resolution. The use of the Nangate library also allowed for a more realistic assessment of the design’s performance metrics, incorporating considerations for process variations and temperature effects, which are critical in modern VLSI design. Additionally, the chosen voltage level of 0.8 V aligns with low-power design strategies, optimizing energy efficiency while maintaining operational reliability.

To ensure that the designs met their functional specifications, an extensive validation process was conducted using ModelSim’s simulation tools. This phase involved the creation of detailed testbenches designed to evaluate a broad spectrum of operational scenarios, including edge cases and corner conditions, thereby ensuring the reliability and accuracy of the outputs under varying circumstances. The thoroughness of this verification process was critical for identifying and addressing potential issues early in the design cycle. By simulating different input combinations and operational states, the team was able to detect discrepancies and rectify them before moving forward. By allowing only those designs that successfully passed all functional tests to advance to the synthesis stage, this approach streamlined the development workflow. This not only enhanced efficiency but also significantly improved the overall quality of the final implementation. Such rigorous validation minimized the risk of late-stage design modifications, ultimately saving time and resources while ensuring that the final product met both performance and reliability standards.

The synthesis phase is a pivotal component of the hardware development lifecycle, where the VHDL code for each multiplier design is converted into a gate-level netlist using the Synopsys Design Compiler. This transformation effectively bridges the gap between abstract design specifications and a concrete representation that can be physically realized, thereby confirming the feasibility of the designs for hardware fabrication. The resulting gate-level netlists offer an in-depth perspective of the logical architecture, facilitating additional optimization and analysis in the later stages of the design process. This compiler leverages the Nangate library, which provides crucial technology-specific data, including gate dimensions, interconnect delays, and power characteristics. These parameters are essential for achieving precise and efficient synthesis, particularly within the challenging context of the 15 nm technology node, where refined geometries and advanced manufacturing techniques present unique challenges and opportunities for enhancement. Throughout this critical phase, the Design Compiler meticulously refines the netlist to adhere to established constraints such as area, timing, and power efficiency. This optimization employs sophisticated algorithms that modify the logical architecture of the design while maintaining its intended functionality. By ensuring that the final implementation meets performance targets and complies with design specifications, the synthesis phase plays a vital role in elevating the overall quality and reliability of the design. Ultimately, this stage is integral to achieving a successful hardware implementation.

By optimizing the logical structure and reducing resource utilization, the synthesis phase significantly enhances the functionality, efficiency, and dependability of the multiplier designs. Serving as a crucial link between theoretical design and tangible hardware realization, this phase lays the groundwork for effective physical implementation by generating detailed gate-level netlists that reflect the actual hardware configuration. The insights gained during synthesis, such as timing characteristics and power consumption metrics, provide valuable feedback for future design cycles, fostering a process of ongoing enhancement. This iterative methodology not only improves the current design but also strengthens the overall development process, ensuring that each new version benefits from the lessons learned in previous stages. By continuously refining the architecture, designers can drive innovation in performance, power efficiency, and resource utilization, ultimately leading to more robust and effective designs.

Once the synthesis process reaches completion, vital performance metrics—including area, delay, and power usage—are meticulously extracted from the synthesized netlists. This extraction is not merely a formality; it serves as a gateway to understanding each design’s operational efficiency. By thoroughly assessing these metrics, engineers can gain significant insights that facilitate a detailed comparison, effectively highlighting the strengths and weaknesses inherent in various multiplier configurations when applied in real-world scenarios. This comprehensive evaluation enables designers to assess how well each architecture performs in terms of resource utilization, speed, and energy efficiency, offering a clear understanding of their applicability in real-world systems. By leveraging these insights, developers can refine their designs, address potential shortcomings, and enhance the overall performance of future multiplier implementations.

The derived post-implementation characteristics of the innovated systolic multiplication unit, juxtaposed against established multipliers’ constructions [19,20,38,39,42], are documented within Table 2 for a finite field dimension of

q = 163

. This table encapsulates key operational parameters, including latency, critical pass delay (CPD), area (A), multiplication delay (D), power dissipation (P), area-delay product (ADP), and power-delay product (PDP), which represents energy (E), all extracted from the implementation reports. In addition, this table includes energy associated with processing each bit (E/bit).We observe that the two multiplier designs mentioned in [38,39] exhibit the lowest latency, while the proposed design consumes the least energy per processed bit.

Visual comparisons of the other metrics are provided in Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, which feature logarithmic-scaled bar graphs comparing the silicon area (A), power consumption (P), delay (D), ADP, and PDP of the innovated single-flow multiplier against rival designs. These graphical illustrations effectively convey the performance trade-offs and benefits of the novel design, offering both quantitative and qualitative observations regarding its operational attributes across a range of performance criteria. By presenting these results, the analysis highlights the advancements achieved with the new design and underscores its competitive advantages in various aspects of performance.

A thorough analysis of the data presented in Table 2 and illustrated in Figure 8 and Figure 9 reveals that the proposed systolic multiplier achieves significant improvements in both area efficiency and power consumption compared to existing designs. The reduction in area utilization is particularly noteworthy, with decreases ranging from 99.6% to 99.8%, highlighting a dramatic contraction of the hardware footprint. This substantial reduction can be attributed to an innovative design strategy that effectively minimizes redundant circuitry while optimizing the arrangement of components to maximize spatial efficiency.

Similarly, the advancements in power consumption are equally impressive, showing reductions between 94.2% and 96.9%. This notable decrease in power usage stems largely from the architecture’s capability to execute computations with fewer clock cycles and diminished switching activity, which directly translates into significant energy savings. Additionally, the proposed design incorporates advanced techniques such as clock gating and data path optimization, further bolstering its overall energy efficiency.

These findings underscore a significant advancement in energy efficiency, establishing the offered design as exceptionally suitable for applications where reducing resource usage and power consumption is crucial. Such enhancements render the architecture particularly advantageous for implementation in resource-constrained IoT applications within smart cities, where the efficient utilization of hardware and energy resources is vital for optimal performance and sustainability. In these scenarios, maintaining high performance while concurrently minimizing area and power requirements can yield substantial benefits, including extended battery life, lower cooling costs, and enhanced overall system reliability. The ability to operate efficiently in such environments not only contributes to the longevity of devices but also supports the overarching goal of creating sustainable urban ecosystems. By integrating this innovative design, smart city applications can achieve a balance between technological advancement and environmental responsibility, paving the way for smarter, more resilient infrastructures.

A key feature of the proposed architectural framework is the observation of a slightly increased delay when compared to certain established architectures, as depicted in Figure 10. This rise in delay can primarily be attributed to the modestly higher operational latency and the increased critical path delay (CPD) that are intrinsic to the proposed configuration. The CPD, which denotes the longest signal propagation path within the arithmetic circuit, significantly influences the overall operational speed of the design. However, despite this minor increase in delay, the proposed architecture continues to deliver computational performance that is comparable to alternative methodologies. This is evidenced by its equivalent asymptotic temporal complexity, which ensures that it remains effective for a diverse array of practical implementations. In these cases, the performance requirements are judiciously balanced against other important design factors, such as area efficiency and power consumption.

The architecture’s ability to deliver competitive performance while optimizing resource utilization makes it a viable choice for applications where trade-off between speed and resource allocation are carefully managed. Moreover, the extended delay does not significantly impact the overall performance in many use cases, particularly in applications where processing speed is less critical than power efficiency and area constraints. In scenarios such as low-power IoT devices, the benefits of reduced area and power consumption may outweigh the drawbacks of increased latency. Thus, the design remains practical, allowing it to excel in environments where resource management is paramount, while still providing satisfactory throughput for the intended computational tasks.

Analyzing Table 2 and Figure 11 and Figure 12, it becomes evident that the proposed systolic multiplier unit offers significant advantages concerning the area-delay product (ADP) and power-delay product (PDP), which are essential design metrics that illustrate the interplay between silicon footprint, signal latency, and energy dissipation. Notably, this design reveals impressive reductions in ADP, with values ranging from 99.5% to 99.9% when compared to existing alternatives. Such a reduction not only reflects a substantial enhancement in performance but also demonstrates an exceptionally efficient use of hardware resources, indicating that the architecture is finely tuned for minimal area while maintaining robust operational capabilities. Furthermore, the advancements in PDP are equally striking, showcasing reductions that span from 92.8% to 98.8%. These remarkable figures underscore the architecture’s superior energy efficiency, highlighting its ability to execute computations with significantly reduced energy expenditure. This level of efficiency is particularly critical in the context of smart city IoT applications, where the longevity of battery life and power consumption are of utmost importance. By successfully minimizing both area and power requirements, this innovative design emerges as a promising solution for fostering sustainable and efficient technologies within modern urban landscapes, ensuring that high performance is achieved without compromising resource conservation.

In synthesizing the preceding analysis, the proposed multiplication architecture demonstrates a compelling integration of efficient hardware resource utilization and energy efficiency, achieving delays comparable to alternative methods. Notably, it achieves significant reductions in ADP and PDP, which reflect optimized resource allocation and energy consumption—factors that are critical for resource-constrained IoT ecosystems in smart cities that demand extended operational lifespans. By minimizing resource requirements, this design facilitates the integration of cryptographic algorithms within encryption processors in IoT devices, which is essential for ensuring secure data transmission in environments characterized by high vulnerability. The architecture’s capacity to perform complex computations with low power consumption enables devices to function reliably over prolonged periods, particularly in battery-operated applications. Furthermore, its scalability permits deployment across a wide array of IoT applications, ranging from smart traffic management systems to environmental monitoring, thus enhancing real-time decision-making capabilities. This inherent flexibility not only bolsters urban development initiatives but also contributes to the overall resilience of smart cities in adapting to dynamic conditions. Ultimately, the architecture’s efficiency, security, and adaptability position it as a transformative solution for next-generation IoT applications in smart cities, which require rapid responses to evolving conditions and user needs. This advancement paves the way for innovative applications that enhance safety, accessibility, and quality of life in urban environments, including improved emergency response systems and personalized public services.

7. Security Analysis and Countermeasures

The proposed systolic array architecture presents a unique combination of inherent security advantages and specific vulnerabilities that must be carefully addressed for cryptographic applications. The design’s regular structure and deterministic operation naturally provide resistance to certain types of side-channel attacks. Its strict clock synchronization and fixed q-cycle latency eliminate data-dependent timing variations, while continuous processing element activation and distributed computation help obscure power signatures that could reveal sensitive information. The regular PE design and linear data flow further facilitate the implementation of systematic security enhancements across the entire array, making it particularly suitable for integrating error-detection mechanisms and masking techniques in future implementations.

However, several potential vulnerabilities require mitigation strategies. The architecture’s regularity may produce identifiable electromagnetic patterns, particularly during the final accumulation stage, while the initialization phase exhibits data-dependent power consumption that could be exploited. The predictable computation flow also creates potential targets for fault injection attacks. These vulnerabilities are particularly relevant for IoT deployments where physical access to devices may be possible. It should be noted that none of the prior works [19,20,38,39,42] included in our performance comparisons address these security aspects, as they focus exclusively on computational efficiency metrics. Therefore, any direct security comparison would be fundamentally imbalanced, as our architecture is specifically designed to facilitate future security enhancements while maintaining its efficiency advantages.

To address these concerns while maintaining computational efficiency, we propose a multi-layered security approach for future implementation. At the circuit level, lightweight first-order masking schemes and randomized operation scheduling can be implemented with minimal overhead. Architectural enhancements will include dynamic basis transformation and operand permutation to break predictable patterns, along with parity-based error detection circuits that leverage the array’s regularity. The systolic structure’s uniformity actually simplifies these implementations compared to less regular architectures, as security operations can be inserted systematically throughout the computation pipeline without disrupting the core operations.

For practical deployment, we envision developing tiered security implementations suitable for different application scenarios in future work. Basic protection will leverage the architecture’s inherent advantages for resource-constrained nodes, while enhanced versions will incorporate masking and detection mechanisms for general IoT devices. High-security implementations would integrate comprehensive countermeasures for critical infrastructure applications. This flexible approach will allow appropriate security levels to be matched with specific deployment requirements while maintaining the

O (q)

complexity scaling.

Ongoing and future research will focus on implementing and validating these security measures through rigorous side-channel evaluation and real-world testing. Particular attention will be given to optimizing the balance between protection strength and performance overhead, including the development of dynamic reconfiguration mechanisms to break attack patterns. The systolic structure’s regularity provides a strong foundation for these security enhancements, transforming what might initially appear as a vulnerability into a systematic advantage for implementing robust cryptographic protections. Future work will demonstrate practical implementations of these security measures across various smart city IoT deployment scenarios, building upon the computational efficiency established in the current design.

8. Key Findings and Conclusions

This study builds upon the authors’ prior research by developing an innovative, high-throughput computational systolic array for Dickson basis multiplication in the binary extension field. Through the implementation of a distinctive systolic array design, refined processing element architectures, and optimized signal routing schemes, this work significantly enhances performance compared to earlier designs. The adopted approach involves the generation of a data dependency graph, which delineates the interdependencies inherent within the chosen multiplication algorithm. Subsequently, the application of optimized temporal scheduling and projection mapping functions to the nodes of the dependency graph facilitates the realization of a bit-parallel systolic multiplier. This architecture affords efficient and rapid multiplication operations utilizing the Dickson basis. A primary advantage of this systolic configuration is its significantly reduced spatial complexity. In contrast to prior parallel implementations that exhibit quadratic spatial scaling, the proposed architecture achieves linear spatial scaling, thereby representing a substantial improvement in resource efficiency, particularly pertinent to VLSI implementations. The complexity analysis confirms a marked reduction in the physical area occupied by the multiplier, further validating its efficiency and suitability for hardware realization. To thoroughly assess the performance of the proposed architecture, both the new design and previously established multiplier configurations were synthesized using the ASIC Nangate standard cell library. The synthesis outcomes indicated considerable decreases in area and power usage, highlighting the efficiency of the architecture. Furthermore, performance indicators, such as the power-delay product and area-delay product, demonstrated significant improvements, thereby confirming the overall effectiveness of the proposed system in achieving its design goals. Consequently, the results reinforce the conclusion that this multiplier design is particularly advantageous for implementing cryptographic algorithms in resource-limited IoT edge devices within smart urban environments. The proposed systolic architecture demonstrates inherent resistance to timing attacks while offering structural advantages for security enhancements. Future refinements will incorporate error detection mechanisms and masking techniques to address electromagnetic/power leakage vulnerabilities, followed by integration into ECC implementations for comprehensive security evaluation in smart city IoT applications.

Author Contributions

Conceptualization, A.I.; methodology, A.I. and F.G.; software, A.I.; validation, A.I.; formal analysis, A.I.; investigation, A.I.; resources, A.I.; data curation, A.I.; writing—original draft preparation, A.I.; writing—review and editing, A.I. and F.G.; visualization, A.I.; supervision, A.I.; project administration, A.I.; funding acquisition, A.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Prince Sattam bin Abdulaziz University, project number (PSAU/2024/01/31440).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors extend their appreciation to Prince Sattam bin Abdulaziz University for funding this research work through project number (PSAU/2024/01/31440).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IoT	Internet of things
ADP	Area-delay product
PDP	Power-delay product
ASIC	Application-specific integrated circuit
ECC	Elliptic curve cryptography
DG	Dependency graph
CPD	Critical path delay

References

Rao, P.M.; Pedada, S.; Jangirala, S.; Das, A.K.; Rodrigues, J.J. Role of IoT in the ages of digital to smart cities: Security challenges and countermeasures. IEEE Internet Things Mag. 2024, 7, 56–64. [Google Scholar] [CrossRef]
Nassereddine, M.; Alex, K. Applications of Internet of Things (IoT) in smart cities. In Advanced IoT Technologies and Applications in the Industry 4.0 Digital Economy; CRC Press: Boca Raton, FL, USA, 2024; pp. 109–136. [Google Scholar]
Vempati, S.; Nalini, N. Securing Smart Cities: A Cybersecurity Perspective on Integrating IoT, AI, and Machine Learning for Digital Twin Creation. J. Electr. Syst. 2024, 20, 1420–1429. [Google Scholar] [CrossRef]
Chen, C.C.; Lee, C.Y.; Lu, E.H. Scalable and Systolic Montgomery Multipliers Over GF(2^m). IEICE Trans. Fundam. 2008, E91-A, 1763–1771. [Google Scholar] [CrossRef]
Chiou, C.W.; Lee, C.Y.; Deng, A.W.; Lin, J.M. Concurrent error detection in Montgomery multiplication over GF(2^m). IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2006, E89-A, 566–574. [Google Scholar] [CrossRef]
Huang, W.T.; Chang, C.; Chiou, C.; Chou, F. Concurrent error detection and correction in a polynomial basis multiplier over GF(2^m). IET Inf. Secur. 2010, 4, 111–124. [Google Scholar] [CrossRef]
Kim, K.W.; Jeon, J.C. Polynomial Basis Multiplier Using Cellular Systolic Architecture. IETE J. Res. 2014, 60, 194–199. [Google Scholar] [CrossRef]
Choi, S.; Lee, K. Efficient systolic modular multiplier/squarer for fast exponentiation over GF(2^m). IEICE Electron. Express 2015, 12, 1–6. [Google Scholar] [CrossRef]
Reyhani-Masoleh, A. A new bit-serial architecture for field multiplication using polynomial bases. In Proceedings of the 7th International Workshop Cryptographic Hardware Embedded Systems (CHES 2008), Washington, DC, USA, 10–13 August 2008; pp. 300–314. [Google Scholar]
Abdulrahman, E.A.H.; Reyhani-Masoleh, A. High-Speed Hybrid-Double Multiplication Architectures Using New Serial-Out Bit-Level Mastrovito Multipliers. IEEE Trans. Comput. 2016, 65, 1734–1747. [Google Scholar] [CrossRef]
Kim, K.W.; Jeon, J.C. A semi-systolic Montgomery multiplier over GF(2^m). IEICE Electron. Express 2015, 12, 1–6. [Google Scholar] [CrossRef]
Ibrahim, A. Novel Bit-Serial Semi-Systolic Array Structure for Simultaneously Computing Field Multiplication and Squaring. IEICE Electron. Express 2019, 16, 20190600. [Google Scholar] [CrossRef]
Kim, K.W.; Lee, J.D. Efficient unified semi-systolic arrays for multiplication and squaring over GF(2^m). Electron. Express 2017, 14, 1–10. [Google Scholar]
Kim, K.W.; Kim, S.H. Efficient bit-parallel systolic architecture for multiplication and squaring over GF(2^m). IEICE Electron. Express 2018, 15, 1–6. [Google Scholar] [CrossRef]
Ibrahim, A. Efficient Parallel and Serial Systolic Structures for Multiplication and Squaring Over GF (2^m). Can. J. Electr. Comput. Eng. 2019, 42, 114–120. [Google Scholar] [CrossRef]
Ibrahim, A.; Gebali, F. Enhancing Security and Efficiency in IoT Assistive Technologies: A Novel Hybrid Systolic Array Multiplier for Cryptographic Algorithms. Appl. Sci. 2025, 15, 2660. [Google Scholar] [CrossRef]
Ibrahim, A.; Gebali, F. Optimizing Security of Radio Frequency Identification Systems in Assistive Devices: A Novel Unidirectional Systolic Design for Dickson-Based Field Multiplier. Systems 2025, 13, 154. [Google Scholar] [CrossRef]
Hasan, A.; Negre, C. Low space complexity multiplication over binary fields with Dickson polynomial representation. IEEE Trans. Comput. 2010, 60, 602–607. [Google Scholar] [CrossRef]
Chiou, C.W.; Lee, C.M.; Sun, Y.S.; Lee, C.Y.; Lin, J.M. High-throughput Dickson basis multiplier with a trinomial for lightweight cryptosystems. IET Comput. Digit. Tech. 2018, 12, 187–191. [Google Scholar] [CrossRef]
Chiou, C.; Sun, Y.S.; Lee, C.M.; Liou, J.Y. Low-complexity unidirectional systolic Dickson basis multiplier for lightweight cryptosystems. Electron. Lett. 2019, 55, 28–30. [Google Scholar] [CrossRef]
Pillutla, S.R.; Boppana, L. Area-efficient low-latency polynomial basis finite field GF(2^m) systolic multiplier for a class of trinomials. Microelectron. J. 2020, 97, 104709. [Google Scholar] [CrossRef]
Imana, J.L. LFSR-Based Bit-Serial GF(2^m) Multipliers Using Irreducible Trinomials. IEEE Trans. Comput. 2020, 70, 156–162. [Google Scholar]
Pillutla, S.R.; Boppana, L. Low-latency area-efficient systolic bit-parallel GF(2^m) multiplier for a narrow class of trinomials. Microelectron. J. 2021, 117, 105275. [Google Scholar] [CrossRef]
Li, Y.; Cui, X.; Zhang, Y. An Efficient CRT-based Bit-parallel Multiplier for Special Pentanomials. IEEE Trans. Comput. 2021, 71, 736–742. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; He, W. Fast hybrid Karatsuba multiplier for type II pentanomials. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2459–2463. [Google Scholar] [CrossRef]
Meher, P.K.; Lou, X. Low-Latency, Low-Area, and Scalable Systolic-Like Modular Multipliers for GF(2^m) Based on Irreducible All-One Polynomials. IEEE Trans. Circuits Syst. I Regul. Pap. 2016, 64, 399–408. [Google Scholar] [CrossRef]
Mohaghegh, S.; Yemiscoglu, G.; Muhtaroglu, A. Low-Power and Area-Efficient Finite Field Multiplier Architecture Based on Irreducible All-One Polynomials. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Zhang, Y.; Li, Y. Efficient Hybrid GF(2^m) Multiplier for All-One Polynomial Using Varied Karatsuba Algorithm. IEICE Trans. Fundam. Electron. Comput. Sci. 2021, 104, 636–639. [Google Scholar] [CrossRef]
Zhou, B.B. A New Bit Serial Systolic Multiplier over GF(2^m). IEEE Trans. Comput. 1988, 37, 749–751. [Google Scholar] [CrossRef]
Fenn, S.T.J.; Taylor, D.; Benaissa, M. A Dual Basis Bit Serial Systolic Multiplier for GF(2^m). Integr. VLSI J. 1995, 18, 139–149. [Google Scholar] [CrossRef]
Lee, C.Y.; Lu, E.H.; Lee, J.Y. Bit-Parallel Systolic Multipliers for GF(2^m) Fields Defined by All-One and Equally-Spaced Polynomials. IEEE Trans. Comput. 2001, 50, 358–393. [Google Scholar]
Lee, C.Y.; Lu, E.H.; Sun, L.F. Low-Complexity Bit-Parallel Systolic Architecture for Computing AB²+C in a Class of Finite Field GF(2^m). IEEE Trans. Circuits Syst. II 2001, 50, 519–523. [Google Scholar]
Lee, C.Y.; Chiou, C.W. Efficient Design of Low-Complexity Bit-Parallel Systolic Hankel Multipliers to Implement Multiplication in Normal and Dual Bases of GF(2^m). IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2005, E88-A, 3169–3179. [Google Scholar] [CrossRef]
Lee, C.Y. Low-latency bit-pararallel systolic multiplier for irreducible x^m + xⁿ + 1 with GCD(m,n) = 1. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2008, 55, 828–837. [Google Scholar]
Bayat-Sarmadi, S.; Farmani, M. High-Throughput Low-Complexity Systolic Montgomery Multiplication Over GF(2^m) Based on Trinomials. IEEE Trans. Circuits Syst. II 2015, 62, 377–381. [Google Scholar]
Mathe, S.E.; Boppana, L. Bit-parallel systolic multiplier over GF(2^m) for irreducible trinomials with ASIC and FPGA implementations. IET Circuits Desvices Syst. 2018, 12, 315–325. [Google Scholar] [CrossRef]
Lee, C.Y.; Chiou, C.W.; Lin, J.M. Concurrent error detection in a polynomial basis multiplier over GF (2^m). J. Electron. Test. 2006, 22, 143–150. [Google Scholar] [CrossRef]
Lee, K. Resource and Delay Efficient Polynomial Multiplier over Finite Fields GF(2^m). J. Korea Soc. Digit. Ind. Inf. Manag. 2020, 16, 1–9. [Google Scholar]
Lee, K. Low Complexity Systolic Montgomery Multiplication over Finite Fields GF(2^m). J. Korea Soc. Digit. Ind. Inf. Manag. 2022, 18, 1–9. [Google Scholar]
Mathe, S.E.; Boppana, L. Design and Implementation of a Sequential Polynomial Basis Multiplier over GF(2^m). KSII Trans. Internet Inf. Syst. 2017, 11, 2680–2700. [Google Scholar]
Gebali, F. Algorithms and Parallel Computers; John Wiley: New York, NY, USA, 2011. [Google Scholar]
Chiou, C.W.; Lin, J.M.; Lee, C.Y.; Ma, C.T. Novel Mastrovito Multiplier over GF(2^m) Using Trinomial. In Proceedings of the 2011 5th International Conference on Genetic and Evolutionary Computing (ICGEC), Kitakyushu, Japan, 29 August–1 September 2011; pp. 237–242. [Google Scholar]
Ibrahim, A.; Gebali, F.; Bouteraa, Y.; Tariq, U.; Ahanger, T.; Alnowaiser, K. Compact Bit-Parallel Systolic Multiplier Over GF(2^m). IEEE Can. J. Electr. Comput. Eng. 2021, 44, 199–205. [Google Scholar] [CrossRef]

Figure 1. DG representing the matrix-vector multiplication

N 1

[17].

Figure 1. DG representing the matrix-vector multiplication

N 1

[17].

Figure 2. DG representing the matrix-vector multiplication

N 2

[17].

Figure 2. DG representing the matrix-vector multiplication

N 2

[17].

Figure 3. DG representing the matrix-vector multiplication

N 3

[17].

Figure 3. DG representing the matrix-vector multiplication

N 3

[17].

Figure 4. DG with assigned scheduling times and PEs indices for each node.

Figure 5. Systolic bit-parallel multiplier structure.

Figure 6. Logic diagram of the solid PE (

{PE}_{t}

) of the systolic arrays.

Figure 6. Logic diagram of the solid PE (

{PE}_{t}

) of the systolic arrays.

Figure 7. Logic diagram of the shaded PEs (

{PE}_{b}

) of the systolic arrays.

Figure 7. Logic diagram of the shaded PEs (

{PE}_{b}

) of the systolic arrays.

Figure 8. Area results.

Figure 9. Power results.

Figure 10. Delay results.

Figure 11. Area-delay product (ADP) results.

Figure 12. Power-delay product (PDP) results.

Table 1. Examination of space and time efficiencies in recommended and rival multiplier designs.

Design	AND	XOR	MUX	Latch	Latency	CPD	Area Complexity	Time Complexity
Chiou-a [19]	$q^{2}$	$3 q^{2} + 2 q$	0	$3 q^{2} + 4 q$	$q + 1$	$Δ_{A} + 3 Δ_{X}$	$O (q^{2})$	$O (q)$
Chiou-b [20]	$q^{2}$	$q^{2} + q$	0	$3 q^{2}$	$q + 2$	$Δ_{A} + Δ_{X}$	$O (q^{2})$	$O (q)$
Lee-a [38]	$q^{2} + q$	$q^{2} + 2 q$	0	$1.6 q^{2} + 4 q$	$(q + 7) / 2$	$Δ_{A} + Δ_{X}$	$O (q^{2})$	$O (q)$
Lee-b [39]	$q^{2} + q$	$q^{2} + (7 q + 1) / 2$	0	$2.1 q^{2} + 6.5 q$	$(q + 7) / 2$	$Δ_{A} + Δ_{X}$	$O (q^{2})$	$O (q)$
Chiou-c [42]	$q^{2}$	$q^{2} + q$	q	$2 q^{2} + 3 q$	$q + 1$	$Δ_{A} + Δ_{X} + Δ_{M}$	$O (q^{2})$	$O (q)$
Proposed	$3 q$	$5 q$	0	$3 q$	q	$2 Δ_{X}$	$O (q)$	$O (q)$

Table 2. Performance evaluation of different multipliers for

q = 163

.

Table 2. Performance evaluation of different multipliers for

q = 163

.

Multiplier	q	Latency	A (Kgates)	D (ns)	P (mW)	ADP	PDP(E)	E/bit	A Saving (%)	P Saving (%)	ADP Saving (%)	PDP Saving (%)
Chiou-a [19]	163	164	3503.3	9.0	116.5	31,554.1	1049.2	6.44	99.8	96.9	99.9	98.8
Chiou-b [20]	163	165	2463.0	5.6	97.7	13,840.5	549.1	3.37	99.7	96.2	99.8	97.7
Lee-a [38]	163	85	1515.5	2.7	63.0	4,167.4	173.4	1.06	99.6	94.2	99.5	92.8
Lee-b [39]	163	85	2172.1	2.7	80.9	5,973.1	222.4	1.36	99.7	95.5	99.6	94.3
Chiou-c [42]	163	164	1885.5	7.4	75.2	13,901.5	554.5	3.40	99.7	95.1	99.8	97.7
Proposed	163	163	6.2	3.4	3.7	21.2	12.6	0.08	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ibrahim, A.; Gebali, F. Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers. Cryptography 2025, 9, 37. https://doi.org/10.3390/cryptography9020037

AMA Style

Ibrahim A, Gebali F. Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers. Cryptography. 2025; 9(2):37. https://doi.org/10.3390/cryptography9020037

Chicago/Turabian Style

Ibrahim, Atef, and Fayez Gebali. 2025. "Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers" Cryptography 9, no. 2: 37. https://doi.org/10.3390/cryptography9020037

APA Style

Ibrahim, A., & Gebali, F. (2025). Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers. Cryptography, 9(2), 37. https://doi.org/10.3390/cryptography9020037

Article Menu

Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers

Abstract

1. Introduction

1.1. Literature Review

1.2. Paper Contribution

1.3. Paper Organization

2. Multiplication Using the Dickson Basis in GF( $2^{m}$ )

3. Systolic Design Methodology

4. Formulating Dependency Graphs

5. Exploring Dickson-Based Systolic Multiplier Framework

Retrieving the Systolic Multiplier Design

6. Results Summary and Insights

6.1. Complexity Analysis

6.2. Implementation Findings

7. Security Analysis and Countermeasures

8. Key Findings and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Enhancing Security for Resource-Constrained Smart Cities IoT Applications: Optimizing Cryptographic Techniques with Effective Field Multipliers

Abstract

1. Introduction

1.1. Literature Review

1.2. Paper Contribution

1.3. Paper Organization

2. Multiplication Using the Dickson Basis in GF( 2 m )

3. Systolic Design Methodology

4. Formulating Dependency Graphs

5. Exploring Dickson-Based Systolic Multiplier Framework

Retrieving the Systolic Multiplier Design

6. Results Summary and Insights

6.1. Complexity Analysis

6.2. Implementation Findings

7. Security Analysis and Countermeasures

8. Key Findings and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. Multiplication Using the Dickson Basis in GF( $2^{m}$ )