Enhancing Cryptographic Solutions for Resource-Constrained RFID Assistive Devices: Implementing a Resource-Efficient Field Montgomery Multiplier

Ibrahim, Atef; Gebali, Fayez

doi:10.3390/computers14040135

Open AccessArticle

Enhancing Cryptographic Solutions for Resource-Constrained RFID Assistive Devices: Implementing a Resource-Efficient Field Montgomery Multiplier

by

Atef Ibrahim

^1,2,*

and

Fayez Gebali

³

¹

Computer Engineering Department, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia

²

King Salman Center for Disability Research, Riyadh 11614, Saudi Arabia

³

Electrical and Computer Engineering Department, University of Victroia, Victoria, BC V8P 5C2, Canada

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(4), 135; https://doi.org/10.3390/computers14040135

Submission received: 24 February 2025 / Revised: 17 March 2025 / Accepted: 24 March 2025 / Published: 6 April 2025

(This article belongs to the Special Issue Wearable Computing and Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

Radio Frequency Identification (RFID) assistive systems, which integrate RFID devices with IoT technologies, are vital for enhancing the independence, mobility, and safety of individuals with disabilities. These systems enable applications such as RFID navigation for blind users and RFID-enabled canes that provide real-time location data. Central to these systems are resource-constrained RFID devices that rely on RFID tags to collect and transmit data, but their limited computational capabilities make them vulnerable to cyberattacks, jeopardizing user safety and privacy. Implementing the Elliptic Curve Cryptography (ECC) algorithm is essential to mitigate these risks; however, its high computational complexity exceeds the capabilities of these devices. The fundamental operation of ECC is finite field multiplication, which is crucial for securing data. Optimizing this operation allows ECC computations to be executed without overloading the devices’ limited resources. Traditional multiplication designs are often unsuitable for such devices due to their excessive area and energy requirements. Therefore, this work tackles these challenges by proposing an efficient and compact field multiplier design optimized for the Montgomery multiplication algorithm, a widely used method in cryptographic applications. The proposed design significantly reduces both space and energy consumption while maintaining computational performance, making it well-suited for resource-constrained environments. ASIC synthesis results demonstrate substantial improvements in key metrics, including area, power consumption, Power-Delay Product (PDP), and Area-Delay Product (ADP), highlighting the multiplier’s efficiency and practicality. This innovation enables the implementation of ECC on RFID assistive devices, enhancing their security and reliability, thereby allowing individuals with disabilities to engage with assistive technologies more safely and confidently.

Keywords:

disabilities; applications in healthcare; medical sensors; security in wearable devices, RFID assistive technology; RFID sensor tags; cryptography; field montgomery multiplier; cryptosystems; IoT security; ubiquitous computing

1. Introduction

The RFID assistive system represents a sophisticated integration of RFID technology and the IoT, aimed at enhancing the autonomy, safety, and quality of life for individuals with disabilities. For instance, an RFID navigation system is a key application that assists blind users in unfamiliar environments by providing vital orientation cues, thereby preventing accidents and promoting safe navigation [1,2]. These systems employ strategically placed RFID tags throughout public spaces, allowing visually impaired users to detect these tags using handheld or wearable RFID readers. This technology provides audible or tactile feedback, guiding users effectively while alerting them to potential hazards in real time. A notable innovation in this field is the RFID cane, which is equipped with a tag reader and antenna that emits radio waves to interact with nearby RFID tags, accurately pinpointing the user’s location [1]. This enhanced mobility aid not only helps users navigate their surroundings but also fosters a sense of security and confidence. Furthermore, the RFID cane can transmit information via Bluetooth or ZigBee, enabling users to save destination names as voice messages, thus streamlining their travel experience and enhancing their ability to explore new environments independently.

Wearable devices that monitor vital signs and transmit data to healthcare professionals play a crucial role in supporting individuals with disabilities by ensuring timely medical assistance tailored to their specific health needs. RFID technology can enhance these devices by enabling precise tracking of users’ health metrics in real time, allowing for immediate intervention when necessary and thereby enhancing the overall well-being of disabled individuals [3,4,5,6,7,8]. Similarly, smart home technologies significantly improve accessibility for disabled users by automating daily tasks and enabling control through voice commands or adaptive interfaces. Incorporating RFID systems into smart home environments can facilitate seamless interactions between users and devices, allowing for personalized settings and efficient management of their surroundings. These technologies empower individuals with disabilities to manage their environments more independently, facilitating activities that may have previously required assistance [1,9,10]. By leveraging RFID technology, these innovations not only empower individuals with disabilities to engage more fully in society by facilitating access to education, employment, and social activities but also streamline daily living, fostering a greater sense of autonomy and confidence that significantly enhances their overall quality of life.

The RFID assistive system operates through a cohesive three-layer architecture that includes the perception layer, the network layer, and the application layer, as displayed in Figure 1. Each layer plays a critical role in ensuring the system’s functionality, security, and adaptability to the unique needs of its users. At the perception layer, multiple slave RFID readers are strategically placed throughout various environments—such as homes, workplaces, and public spaces—to continuously scan for RFID tags worn by disabled people or attached to nearby objects. Each RFID tag emits a unique identifier when it comes within proximity of a reader, allowing for precise tracking of the disabled individual’s location and interactions with their surroundings. The master RFID reader collects data from all slave readers, aggregating this information to ensure comprehensive coverage and accurate monitoring of the individual’s presence and interactions.

The data are captured and transmitted to the network layer, which comprises a router, an IP network, and a gateway. The router connects the master RFID reader to the IP network, facilitating seamless communication between the various components of the system. This network enables the flow of data from the RFID readers to the operational support platform. The gateway plays a critical role in linking the IP network to this platform, ensuring that the data processed by the readers can be effectively accessed and utilized for further analysis. The incorporation of IoT technology allows for enhanced data transmission and remote monitoring, enabling continuous updates and real-time insights.

In the application layer, the data stored in the RFID database plays a vital role in the overall functionality of the system. This database not only stores information about the RFID tags and their associated disabled individuals but also facilitates data retrieval and analysis to generate actionable insights. The data are transformed into relevant information through a monitoring application server, which hosts applications designed to provide tailored assistance based on the unique needs of each disabled individual. For example, the system can provide auditory alerts to people with visual impairments when they approach obstacles or important objects, helping them navigate their environment safely. For those with hearing impairments, visual indicators or vibrations can signal important notifications or alerts in their surroundings. Disabled individuals with mobility impairments can benefit from the system’s ability to automatically adjust the environment, such as opening doors or rearranging furniture, to facilitate easier access and movement. Additionally, the system can assist individuals with cognitive impairments by offering reminders or prompts through various devices, such as smartphones or tablets, to help with daily tasks and routines.

The RFID server serves as a critical component in the application layer, managing the communication between the RFID readers and the operational support platform. It processes incoming data from the readers, ensuring efficient data handling and storage while facilitating the integration of various system components, thereby enhancing overall system performance and reliability.

The monitoring application server connects not only to disabled individuals but also to doctors, nurses, and caregivers, allowing them to stay informed about those they support. These healthcare professionals can access real-time data, receive alerts about unusual activities, and monitor the health and safety of their patients through a user-friendly interface on smartphones or tablets. For instance, if a visually impaired individual approaches a specific object, the system can generate auditory alerts or provide contextual information about that object.

Through its connection with IoT, the RFID assistive system can also connect to other smart devices within the environment. This allows for automated responses, such as adjusting lighting or providing notifications to caregivers when specific conditions are met. Overall, the RFID assistive system represents a comprehensive solution designed to enhance the safety and autonomy of individuals with disabilities. Additionally, it fosters a supportive environment by ensuring that caregivers and healthcare professionals remain informed and engaged in the care process. Through the seamless integration of data collection, processing, and application, the system guarantees that disabled individuals receive timely assistance tailored to their specific needs. This approach not only promotes their independence but also significantly improves their overall quality of life.

The RFID assistive system, while offering significant benefits to individuals with disabilities, presents several privacy and security concerns at each layer of its architecture [11,12]. Understanding these vulnerabilities is critical to ensuring user safety and maintaining trust in the technology. At the perception layer, eavesdropping emerges as a prominent threat. Attackers can intercept communication between RFID tags and readers, thereby exposing sensitive information regarding an individual’s location and activities. For instance, a visually impaired user utilizing an RFID navigation system may be subjected to tracking by malicious actors, leading to potential harassment or exploitation. The ramifications of such breaches are severe, as they undermine users’ sense of safety and independence.

Relay attacks represent another significant issue within this layer. In this scenario, an attacker employs two devices to extend the communication range between an RFID tag and a reader, misleading the system into believing that the user is situated in a different location. This deception can enable unauthorized access to secure areas or generate false alarms, resulting in confusion and potential harm, particularly for users with cognitive impairments who may struggle to process unexpected situations. For individuals with mobility impairments, being inadvertently directed into restricted areas can result in hazardous situations, especially in unfamiliar environments.

Denial of Service (DoS) attacks can significantly disrupt the functionality of the RFID assistive system. By overwhelming communication channels, attackers can render the system inoperative, depriving users of the critical guidance they rely upon for navigation. For example, if a visually impaired individual’s navigation system fails due to a DoS attack, they may become disoriented and susceptible to accidents. This scenario underscores their reliance on technology for safety, highlighting how disruptions can significantly impact their ability to navigate effectively and maintain independence.

Spoofing represents another significant concern at the perception layer, wherein attackers impersonate legitimate RFID tags. This fraudulent activity can cause the system to provide inaccurate information, thereby exposing individuals with cognitive impairments to misleading alerts or directions. Such misrepresentation may place these individuals in precarious situations, undermining their safety and autonomy.

While the perception layer faces substantial risks, the network layer is also susceptible to security threats that can compromise the integrity and confidentiality of user data. Man-in-the-Middle (MitM) attacks can intercept and alter communications between RFID readers and the gateway, jeopardizing the accuracy of the information transmitted. For instance, if health data from a wearable device is tampered with during transmission, it could lead to inappropriate medical interventions, thereby endangering the health of individuals with chronic conditions. Furthermore, data tampering within the network layer can compromise the reliability of the system. Unauthorized modifications to transmitted data could result in erroneous system responses, such as failing to alert caregivers during emergencies. For disabled individuals, especially those with mobility impairments, such failures can have severe consequences, including delayed assistance in critical situations.

At the application layer, security vulnerabilities can lead to severe privacy violations. Data breaches may expose sensitive information about disabled individuals, encompassing their health data and daily routines. Such breaches not only compromise privacy but can also result in discrimination or exploitation. For caregivers and healthcare providers, unauthorized access to the RFID database can lead to misinformation, resulting in inappropriate care or neglect. Additionally, malware attacks targeting the application layer can disrupt the monitoring application server, impairing its functionality or facilitating the theft of sensitive data. This disruption can prevent caregivers from receiving critical alerts or accessing real-time information about their patients, further jeopardizing the health and safety of disabled individuals who rely on timely interventions.

While security issues at the network and application layers are crucial, this study refrains from addressing solutions for these areas, as the existing literature has already provided a thorough exploration of various security measures and a comprehensive understanding of the related threats and mitigations. Focusing on the perception layer is especially urgent due to its direct impact on user safety and the immediate threats faced by individuals with disabilities. By prioritizing this layer, we aim to improve the security and effectiveness of assistive technologies that are essential for their safety and independence.

To mitigate cyberattacks targeting the RFID assistive system at the perception layer, significant efforts have been made to develop tailored security solutions. Advanced authentication protocols, particularly those employing ECC, offer enhanced security by leveraging strong cryptographic properties. However, implementing ECC on low-cost RFID tags is challenging due to the associated high computational demands, which often exceed the limited processing power, memory, and battery life of these devices [8,13]. In response, lightweight protocols utilizing simpler encryption methods have been proposed, but these solutions often exhibit vulnerabilities, such as susceptibility to impersonation and eavesdropping [14,15,16,17]. Recent innovations, including group authentication protocols compliant with established standards and schemes based on permutation matrix encryption, aim to balance security and efficiency for low-cost RFID systems [18]. Despite these advancements, many solutions assume secure communication between servers and readers, which may not be feasible in mobile RFID contexts where communication channels are inherently less secure.

We have chosen to employ ECC in our RFID assistive system rather than adopting other recommended lightweight cryptographic standards such as ASCON and GIFT-COFB due to several compelling advantages [19,20]. ECC is recognized for its efficiency as a public-key cryptographic method that leverages the mathematical properties of elliptic curves over finite fields, providing robust security with smaller key sizes—typically 256 bits—equivalent to much larger keys in traditional algorithms like RSA. This not only enhances security but also reduces computational load, making ECC particularly suitable for resource-constrained environments where efficiency and power conservation are essential. Furthermore, ECC supports a comprehensive range of functionalities, including encryption, user and message authentication, digital signatures, and key distribution, which are critical for establishing trust and verifying identities in secure communications. In contrast, ASCON and GIFT-COFB are primarily symmetric algorithms focused solely on encryption and message authentication, lacking the versatility and broader application scope that ECC offers. Table 1 presents a comparison of these algorithms to ECC; ASCON is an effective authenticated encryption algorithm that utilizes a permutation-based structure, while GIFT-COFB relies on a lightweight block cipher. While symmetric algorithms may provide speed advantages, they do not support the public-key functionalities essential for digital signature and secure key distribution. Moreover, every algorithm carries its own potential vulnerabilities: ECC may have potential vulnerability to side-channel-attacks if not implemented properly. Additionally, ASCON may be susceptible to collision attacks based on its hashing foundation, and GIFT-COFB may face challenges related to certain cryptanalysis techniques that its security are based on using light weight block cipher. Overall, the choice of ECC is driven by its superior efficiency, robust security, and ability to deliver a comprehensive suite of cryptographic functionalities essential for our RFID assistive system.

Implementing ECC-based authentication can significantly enhance the security of RFID assistive systems. However, deploying ECC on resource-constrained RFID tags presents several challenges. These tags typically have limited processing power, memory, and battery life, making it difficult to execute complex cryptographic operations efficiently. The computational overhead associated with ECC may lead to slow response times, increased energy consumption, and potential system failures, all of which are critical concerns for devices designed to support individuals with disabilities.

To address these challenges, a specialized hardware solution tailored for ECC can be developed. This solution would optimize the hardware architecture to accommodate the unique constraints of RFID tags while maintaining the integrity of cryptographic operations. The ECC algorithm primarily involves finite field arithmetic, with finite field multiplication serving as the foundational operation for various cryptographic functions, including field inversion and field division [21,22,23,24,25,26,27,28,29,30,31,32]. Therefore, finite field multiplication is a critical component of ECC, and its efficient implementation is essential for the overall performance and feasibility of cryptographic solutions on resource-constrained devices. Successfully implementing this operation on RFID tags is essential for the effective deployment of robust ECC cryptographic solutions. By optimizing this operation, it becomes feasible to execute ECC computations without overloading the limited resources of these tags. Traditional multiplication designs are often unsuitable for such devices due to their excessive area and energy requirements, which can hinder their practicality in low-cost, resource-constrained environments.

This work tackles these challenges by proposing an efficient and compact one-dimensional bit-parallel semi-systolic field multiplier optimized for the Montgomery multiplication algorithm, a widely used method in cryptographic applications. The proposed design significantly reduces both space and energy consumption while maintaining computational performance, making it well-suited for resource-constrained environments. By minimizing the area and power requirements of finite field multiplication, this solution enables the practical implementation of ECC on RFID tags, enhancing the overall security of the system. This strategy not only improves the security of RFID assistive systems but also ensures that individuals with disabilities can access reliable and effective assistive technology.

Based on the background provided, the remainder of this study will primarily focus on the design methodology used for implementing an efficient and compact finite field multiplier that supports the computational demands of the ECC algorithm. This makes it suitable for integration into resource-constrained RFID tags designed for individuals with disabilities. The ASIC implementation results presented at the conclusion of this study demonstrate significant improvements in key metrics, including area, power consumption, Power-Delay Product (PDP), and Area-Delay Product (ADP), ensuring compliance with the stringent requirements of low-cost RFID assistive devices while delivering robust security. This innovation has the potential to greatly enhance the safety and autonomy of individuals with disabilities, allowing them to benefit from secure and reliable assistive technologies in their everyday lives.

2. Literature Review

The effectiveness of finite field multiplications depends critically on the choice of base representations for the elements in GF(

2^{m}

). The advantages of other basis representations, such as Polynomial Basis (PB), Normal Basis (NB), Dual Basis (DB), and Redundant Basis (RB), are clear [33]. When it comes to these representations, polynomial basis arithmetic stands out as a simple, reliable, and scalable method, especially for hardware implementation [34,35,36,37,38]. Polynomial basis arithmetic is commonly used in cryptographic protocols because it does not require basis conversion, unlike other representations. The effectiveness of finite field multipliers is greatly influenced by the irreducible polynomial chosen. All-Ones Polynomial (AOP), trinomials, and pentanomials are a few examples of irreducible polynomials that are employed in cryptographic techniques. Despite the fact that trinomial and pentanomial-based multipliers are more efficient, generic polynomial-based multipliers are still useful for a variety of applications. Although irreducible AOPs are not utilized as often as irreducible trinomials or pentanomials, they still have the potential to design efficient multipliers [39,40,41].

Different methods of implementation can lead to the creation of diverse multipliers with varying characteristics. Bit-serial multipliers are known for their space efficiency and significant power savings, albeit at the expense of slower operation, requiring m clock cycles to multiply two elements [22,42,43]. On the other hand, bit-parallel multipliers provide the advantage of producing results in a single clock cycle but come with higher hardware costs and power consumption [24,25,28,44,45,46,47,48,49]. In the framework of Very Large Scale Integration (VLSI) implementations, systolic/semi-systolic serial or concurrent multiplier topologies are chosen over traditional approaches. This is because they possess qualities such as regularity, modularity, local relatively homogeneous connectivity, and concurrency, which make them ideal for VLSI designs. Furthermore, systolic/semi-systolic arrays have underlying pipeline properties that allow for high clock frequencies irrespective of substantial resource use.

Numerous researchers have made significant efforts to devise effective implementations of systolic/semi-systolic multipliers for binary extension fields GF(

2^{m}

). These studies primarily focus on constructing multiplier architectures using specific irreducible polynomials. For instance, Lee and Chiou introduced in their work an error-detecting semi-systolic array multiplier [22,50]. Huang proposed an efficient semi-systolic array multiplier aimed at reducing both time and space costs [23]. In their work, Choi and Lee addressed the need for a highly efficient systolic array architecture that performs unified multiplication and squaring operations with minimal hardware overhead [25]. They developed a serial and parallel systolic array that enables rapid modular exponentiation by concurrently executing multiplication and squaring operations. This architecture is designed to optimize performance while minimizing resource utilization. Moreover, the incorporation of LSB-first multiplication and exponentiation algorithms further enhances the efficiency of the systolic array, allowing for faster computation of modular exponentiation.

In a separate study, Chiou proposed a semi-systolic array multiplier that offers reduced time complexity [51]. By carefully designing the multiplier structure, they achieved significant improvements in computational efficiency. This reduction in time complexity is crucial for applications that require fast and efficient multiplication, such as cryptographic algorithms. Recent research by Lee introduced novel semi-systolic Montgomery modular multipliers that leverage two levels of systolic computation [52,53]. These multipliers demonstrate efficient area utilization and reduced delay, which are essential factors in VLSI implementations. The utilization of systolic computation allows for parallel processing and pipelining, enabling faster and more efficient modular multiplication operations.

Mathe and Boppana proposed a multiplier architecture that supports both parallel and serial inputs [54]. This architecture provides flexibility in handling different types of operands and allows for efficient multiplication and squaring operations. By accommodating various input configurations, the multiplier can adapt to different application requirements and optimize performance accordingly. Additionally, Ibrahim introduced efficient one-dimensional bit-serial and bit-parallel systolic array structures for multiplication and squaring operations over GF(

2^{m}

) [32]. These structures are specifically designed for computations within the finite field GF(

2^{m}

) and offer efficient utilization of hardware resources. The bit-serial and bit-parallel systolic arrays enable optimized processing of binary data, making them suitable for applications such as error correction codes and cryptography.

In recent research by Pillutla and Boppana, a novel GF(

2^{m}

) polynomial basis systolic multiplier was introduced, specifically targeting field sizes m = 233 and m = 409 [36]. The proposed architecture incorporates suggested trinomials to enhance its performance. While GF(

2^{m}

)-based multipliers have been created for a variety of applications, they frequently encounter difficulties owing to their high hardware complexity and large delay durations, especially in security-related applications. Therefore, it is crucial to conduct further investigations to explore multiplication architectures that can provide efficient performance while minimizing space and time requirements.

In the realm of bit-parallel systolic multipliers, a multiplier structure was initially proposed by Lee that utilized equally spaced and AOP polynomials as the foundation for its design [44,45]. However, in 2005, Lee introduced a mapping approach aimed at reducing the complexity of the AOP-based bit-parallel systolic multiplier [46]. The foundation of the multiplier was changed from AOP polynomials to trinomials as part of this mapping method. The multiplier architecture’s complexity was significantly reduced by using trinomials as the foundation, resulting in enhanced efficiency and performance.

In an effort to decrease the complexity of the Montgomery-based bit-parallel multiplier, Lee proposed a novel approach that involved utilizing the Toeplitz matrix-vector representation [47]. By employing this technique, the complexity associated with the multiplier was effectively reduced, resulting in a more efficient and practical design. The use of Toeplitz matrices in the representation of the multiplier allowed for efficient and streamlined computations, improving overall performance.

Sarmadi introduced a two-dimensional parallel systolic multiplier based on the Montgomery algorithm [48]. This innovative multiplier structure offered high performance capabilities while also minimizing space requirements. The parallel systolic design allowed for simultaneous and independent processing of multiple operations, leading to improved throughput and reduced computation time. By optimizing the space utilization, Sarmadi’s approach provided an efficient solution for applications that demand high-performance multiplication operations. Building upon these advancements, Mathe implemented a two-dimensional parallel systolic multiplier structure designed to minimize space overhead. The structure was based on an interleaving multiplication method over GF(

2^{m}

) [49]. By utilizing interleaving techniques, Mathe achieved an efficient utilization of hardware resources while maintaining high-performance multiplication operations. The resulting multiplier architecture offered a balanced trade-off between space requirements and computational efficiency.

2.1. Paper Contribution

This work focuses on the development of a one-dimensional bit-parallel semi-systolic implementation of the field Montgomery multiplication algorithm, as suggested by Lee et al. [53]. The algorithm efficiently handles multiplication operations over GF(

2^{m}

) by utilizing a general irreducible polynomial as its foundation. One notable advantage of the adopted Montgomery algorithm, in contrast to many other algorithms, is its ability to reduce latency. Furthermore, it offers the additional benefit of minimizing time and area overhead by employing the same architecture for executing both iterative parts of the algorithm [53]. However, previous works in the literature often rely on ad hoc approaches when extracting the hardware structure, without considering how the structure could be modified to optimize system performance factors such as latency, throughput, power, and area. Unlike ad hoc approaches, the proposed method takes into account the selection of appropriate scheduling and projection functions in order to extract an optimal architecture that is specifically tailored to meet the requirements of the target application. This mathematical approach offers several advantages, including the ability to systematically analyze the multiplier structure and optimize its performance characteristics. By adopting this mathematical approach, it becomes possible to enhance the overall efficiency of the systolic/semi-systolic implementation. The proposed methodology facilitates a more systematic and comprehensive exploration of the design space, enabling the identification of the most suitable architectural configuration for achieving optimal performance metrics. This, in turn, contributes to the overall improvement of system performance, including reduced latency, increased throughput, optimized power consumption, and minimized area overhead.

In order to facilitate the extraction of the Dependency Graph (DG) for the algorithm, the proposed multiplier structure is expressed in a bit-level format. This allows for easier visualization and analysis of the dependencies between different operations. The DG acts as a great tool for generating the recommended multiplier construction, assisting in its development and optimization, by carefully picking suitable time-scheduling and node-projection functions. One of the primary advantages of the suggested multiplier structure is its much lower area complexity when compared to previously identified bi-dimensional arrangements. While many known structures have an spatial complexity of order

O (m^{2})

, the proposed building has an spatial complexity of order

O (m)

. This reduction in complexity results in significant reductions in physical space and energy consumption, making the construction more efficient and inexpensive. Despite the reduced area complexity, the performance of the recommended multiplier construction remains the same. It has the same temporal delays as bi-dimensional constructions, ensuring fast computing. This means that, while the structure provides considerable advantages in terms of decreased complexity and power consumption, it does not sacrifice performance. Furthermore, the suggested multiplier design’s modular form and local connectivity between the constituent Processing Elements (PEs) make it ideal for VLSI implementation. The local link between the PEs simplifies the overall construction and improves performance by reducing wire delays. This local connectivity not only enhances data transmission efficiency but also adds to the overall effectiveness of the suggested multiplier organization.

The significant savings in space, delay, and power provided by the proposed multiplier architecture make it a highly promising solution for implementing computationally demanding cryptographic protocols in small RFID sensor tags tailored for applications that support individuals with disabilities. This innovation paves the way for enhancing the security and efficiency of RFID-based assistive systems, ultimately improving the quality of life and safety of disabled individuals.

2.2. Paper Organization

An outline of the paper’s structure is provided as follows: in Section 3, the chosen Montgomery multiplication algorithm is mathematically modeled, providing a clear understanding of its underlying principles and operations. Additionally, the bit-level representation of the algorithm is presented, which serves as a crucial foundation for further analysis and design considerations. Moving forward, Section 4 delves into the detailed description of the DG associated with the adopted algorithm. This section explores the intricate relationships and dependencies between various operations within the algorithm. Section 5 focuses on outlining the process of obtaining the suggested one-dimensional bit-parallel semi-systolic multiplier layout. This section explains the step-by-step methodology employed in designing the multiplier structure, taking into account the insights derived from the DG and specific requirements of the algorithm. To thoroughly evaluate the proposed multiplier’s effectiveness, Section 6 conducts a comprehensive analysis of its complexity. This includes an examination of the space complexity, comparing it to existing effective multipliers to highlight potential space savings. Moreover, the time complexity is also analyzed and compared with relevant multiplier designs to assess its efficiency. Furthermore, the paper includes a thorough evaluation of the performance of the suggested multiplier design, providing a realistic assessment of its capabilities. This evaluation includes a comparison with other multiplier designs that have been synthesized using ASIC technology. Finally, Section 7 provides a concise summary of the paper’s key findings and contributions.

3. Montgomery Multiplication in GF( $2^{m}$ )

Table 2 summarizes the notations used in the Montgomery multiplication algorithm over GF(

2^{m}

), along with their definitions. This serves as a reference to clarify the roles and significance of each term within the algorithm.

Assume that

H (λ)

is an irreducible polynomial of degree m, with coefficients

h_{m}

and

h_{0}

both equivalent to 1, and that it is employed to create the finite field GF

(2^{m})

. Considering that

σ

is a root of

H (λ)

, this implies that

h_{σ}

should be equivalent to 0. A polynomial of degree less than m over GF(2) should be used to define the elements in GF

(2^{m})

. Bitwise exclusive-OR (XOR) is an efficient way to add two polynomials in GF(

2^{m}

). In contrast, multiplying two polynomials in GF(

2^{m}

) is somewhat challenging as the intermediate outcomes necessitates a further modular reduction by

σ^{m} = \sum_{j = 1}^{m} h_{j - 1} σ^{j - 1}

.

Presume that two of the GF(

2^{m}

) elements to be multiplied are

χ

and

τ

. Assume that

π

is a unique factor meeting

g c d (π, H) = 1

and that C and D are the Montgomery residues of

χ

and

δ

, respectively. The Montgomery Modular Multiplication (MMM) of

C = χ π mod H = \sum_{j = 1}^{m} c_{j - 1} σ^{j - 1}

and

D = τ π mod H = \sum_{j = 1}^{m} d_{j - 1} σ^{j - 1}

is computed by

T = C D π^{- 1} mod H = χ τ π mod H = \sum_{j = 0}^{m - 1} t_{j} \cdot σ^{j}

. Employing T and 1 as inputs, Montgomery multiplication is then applied to yield the final output

ψ

, which is calculated as follows:

ψ = T π^{- 1} mod H = χ τ mod H

.

In numerous applications involving recurring multiplications, like inversion, exponentiation, and elliptic curve point multiplication, Montgomery multiplication is beneficial due to the necessity of pre- and post-transformation. Due to its greater efficiency in the execution of elliptic curve point multiplication, field size m is usually selected as an integer with an odd value in real-world applications. For instance, the five binary fields that the National Institute of Standards and Technology (NIST) recommends for use with ECC, (

m = 163, 233, 283, 409, 571

), have the characteristic that m is odd. In this situation, we are able to select

π = σ^{(m - 1) / 2}

to represent the Montgomery multiplication

T = C D π^{- 1} mod H

, as the sum of two separately calculating polynomials

U = \sum_{j = 1}^{m} u_{j - 1} σ^{j - 1}

and

V = \sum_{j = 1}^{m} v_{j - 1} σ^{j - 1}

. The calculation of T can be given as [52,53]:

\begin{matrix} T & = (\sum_{j = 1}^{m} d_{j - 1} σ^{j - 1}) C σ^{- (m - 1) / 2} mod H \end{matrix}

\begin{matrix} = \sum_{j = (m + 1) / 2}^{m} d_{j - 1} C σ^{(2 j - m - 1) / 2} mod H \end{matrix}

(1a)

\begin{matrix} + \sum_{j = (m - 1) / 2}^{1} d_{j - 1} C σ^{(2 j - m - 1) / 2} mod H \end{matrix}

(1b)

We focus on Equation (1a), of the previous expression. Consider that

C^{(i)} = C^{(i - 1)} σ mod H

, where

C^{(i)} = \sum_{j = 1}^{m} c_{j - 1}^{i} σ^{j - 1}

and

C^{(0)} = C

, denote the intermediary outcome at the

i^{t h}

iteration for

i \in [1, (m + 1) / 2]

and

j \in [1, m]

. Using the expanded form, we are able to denote

C^{(i)}

as follows:

\begin{matrix} C^{(i)} & = (\sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j - 1}) σ mod H \\ = \sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j} mod H \\ = c_{m - 1}^{(i - 1)} \sum_{j = 1}^{m} h_{j - 1} σ^{j - 1} + \sum_{j = 1}^{m - 1} c_{j - 1}^{(i - 1)} σ^{j} \end{matrix}

(2)

It is quite obvious that Equation (2) is produced by moving the coefficients of

C^{(i - 1)}

by 1 to the left side, and then H is used to reduce the term

c_{m - 1}^{(i - 1)} σ^{m}

. As a result, we are left with the following

C^{(i)}

coefficients:

\begin{matrix} c_{j}^{(i)} = & c_{j - 1}^{(i - 1)} + c_{m - 1}^{(i - 1)} h_{j} \end{matrix}

(3)

with

c_{j}^{0} = c_{j}

,

0 \leq j \leq m - 1

, and

c_{m - 1}^{(i - 1)}

can be allocated to

c_{0}^{i}

for

1 \leq i \leq (m - 1) / 2

. Right now, we are able to formulate Equation (1a) as:

\begin{matrix} U^{(i)} = U^{(i - 1)} + d_{(m + 2 i - 3) / 2} C^{(i - 1)} \end{matrix}

(4)

with

U^{(i)} = \sum_{j = 1}^{m} u_{j - 1}^{(i)} σ^{j - 1}

defining the ith interim outcome. It is possible to rephrase

U^{(i)}

as follows:

\begin{matrix} U^{(i)} & = (\sum_{j = 1}^{m} u_{j - 1}^{(i)} σ^{j - 1}) \\ = \sum_{j = 1}^{i} d_{(m + 2 j - 3) / 2} C σ^{j - 1} \\ = \sum_{j = 1}^{m} u_{j - 1}^{(i - 1)} σ^{j - 1} + d_{(m + 2 i - 3) / 2} C σ^{i - 1} \\ = \sum_{j = 1}^{m} u_{j - 1}^{(i - 1)} σ^{j - 1} + d_{(m + 2 i - 3) / 2} \sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j - 1} \end{matrix}

(5)

with

U^{(0)} = 0

. Equation (5) can simply be produced by multiplication and accumulation (MAC) operations. The values of the coefficients of

U^{(i)}

can be found using Equation (5) in the following manner:

\begin{matrix} u_{j - 1}^{(i)} = & u_{j - 1}^{(i - 1)} + d_{(m + 2 i - 3) / 2} c_{j - 1}^{(i - 1)} \end{matrix}

(6)

with

u_{j}^{0} = 0

and

c_{j}^{0} = c_{j}

for

0 \leq j \leq m - 1

. Additionally,

c_{0}^{i} = c_{m - 1}^{(i - 1)}

for

1 \leq i \leq (m - 1) / 2

. The outcome of

U^{(m + 1) / 2}

becomes available after

(m + 1) / 2

rounds. It should be noted that Equations (2) and (6), which share the same term

c_{j - 1}^{(i - 1)}

, are additionally independent of one another and can therefore be evaluated simultaneously.

Now, we will examine Equation (1b),

Assuming that

C^{(i)} = C^{(i - 1)} σ^{- 1} mod H

, with

C^{(i)} = \sum_{j = 1}^{m} c_{j - 1}^{i} σ^{j - 1}

denotes the interim outcome at the ith iteration and

C^{(0)} = C

, for

i \in [1, (m + 1) / 2]

and

j \in [1, m]

.

Any irreducible polynomial should have the properties that

h_{0} = h_{m} = 1

, and

σ

is a root of H. We can generate

σ^{- 1} = \sum_{j = 1}^{m} h_{j} σ^{j - 1}

by multiplying the two components of

H = \sum_{j = 1}^{m + 1} h_{j - 1} σ^{j - 1}

= 0 by

σ^{- 1}

and rearranging the terms. By applying

σ^{- 1}

,

C^{(i)}

could be constructed in the following way:

\begin{matrix} C^{(i)} & = (\sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j - 1}) σ^{- 1} mod H \\ = \sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j - 2} mod H \\ = c_{0}^{(i - 1)} \sum_{j = 1}^{m} h_{j} σ^{j - 1} + \sum_{j = 2}^{m} c_{j - 1}^{(i - 1)} σ^{j - 2} \end{matrix}

(7)

It is quite obvious that Equation (7) is produced by moving the coefficients of

C^{(i - 1)}

by 1 to the right side, and then H is used to reduce the term

c_{0}^{(i - 1)} σ^{- 1}

. As a result, we are left with the following

C^{(i)}

coefficients:

\begin{matrix} c_{m - j - 1}^{(i)} = & c_{m - j}^{(i - 1)} + c_{0}^{(i - 1)} h_{m - j} \end{matrix}

(8)

with

c_{j}^{0} = c_{j}

,

0 \leq j \leq m - 1

, and

c_{0}^{(i - 1)}

can be allocated to

c_{m - 1}^{i}

for

1 \leq i \leq (m - 1) / 2

. Right now, we are able to formulate Equation (1b) as:

\begin{matrix} V^{(i)} = V^{(i - 1)} + d_{(m - 2 i + 1) / 2} C^{(i - 1)} \end{matrix}

(9)

with

V^{(i)} = \sum_{j = 1}^{m} v_{j - 1}^{(i)} σ^{j - 1}

defining the ith interim outcome. It is possible to rephrase

V^{(i)}

as follows:

\begin{matrix} V^{(i)} & = (\sum_{j = 1}^{m} v_{j - 1}^{(i)} σ^{j - 1}) \\ = \sum_{j = 1}^{i} d_{(m - 2 j + 1) / 2} C σ^{j - 1} \\ = \sum_{j = 1}^{m} v_{j - 1}^{(i - 1)} σ^{j - 1} + d_{(m - 2 i + 1) / 2} C σ^{j - 1} \\ = \sum_{j = 1}^{m} v_{j - 1}^{(i - 1)} σ^{j - 1} + d_{(m - 2 i + 1) / 2} \sum_{j = 1}^{m} c_{j - 1}^{(i - 1)} σ^{j - 1} \end{matrix}

(10)

with

V^{(0)} = 0

. Equation (10) can simply be produced by multiplication and accumulation (MAC) operations. The values of the coefficients of

V^{(i)}

can be found using Equation (10) in the following manner:

\begin{matrix} v_{m - j}^{(i)} = & v_{m - j}^{(i - 1)} + d_{(m - 2 i + 1) / 2} c_{m - j}^{(i - 1)} \end{matrix}

(11)

with

v_{j}^{0} = 0

,

d_{(m - 1) / 2} = 0

, and

c_{j}^{0} = c_{j}

for

0 \leq j \leq m - 1

,

c_{m - 1}^{(i)} = c_{0}^{(i - 1)}

for

1 \leq i \leq (m - 1) / 2

. The outcome of

V^{(m + 1) / 2}

becomes available after

(m + 1) / 2

rounds. It should be noted that Equations (8) and (11), which share the same term

c_{m - j}^{(i - 1)}

, are additionally independent of one another and can therefore be evaluated simultaneously.

Ultimately,

U^{(m + 1) / 2}

and

V^{(m + 1) / 2}

have to be combined by m 2-input XOR gates to produce the MMM result T as an output.

Algorithms 1 and 2 reflect the algorithmic framework of the already addressed equations. As we notice, Algorithm 2 is the bit-level version of Algorithm 1.

Algorithm 1 Efficient Montgomery Multiplication Algorithm over Binary Finite Fields

Input: C, D, π−1 = σ^−(m−1)/2, and H

Output: T

Initialization:

U⁰ ← 0, V⁰ ← 0, C⁰ ← C

Algorithm:

1:: for $1 \leq i \leq (m + 1) / 2$ do
2:: $C^{i} = C^{i - 1} σ mod H$
3:: $U^{i} = U^{i - 1} + d_{(m + 2 i - 3) / 2} C^{i - 1}$
4:: end for
5:: for $1 \leq i \leq (m + 1) / 2$ do
6:: $C^{i} = C^{i - 1} σ^{- 1} mod H$
7:: $V^{i} = V^{i - 1} + d_{(m - 2 i + 1) / 2} C^{i - 1}$
8:: end for
9:: $T = U^{(m + 1) / 2} + V^{(m + 1) / 2}$

Algorithm 2 Bit-Level Implementation of the Montgomery Multiplication Algorithm

Input: C = (c_m−1c_m−2 ⋯ c₀), D = (d_m−1d_m−1 ⋯ d₀), H = (h_mh_m−1 ⋯ h₀)

Output: T = (t_m−1t_m−2 ⋯ t₀)

Initialization:

U^{0} = (u_{m - 1}^{0} u_{m - 2}^{0} \dots u_{0}^{0}) \leftarrow (00 \dots 0)

V^{0} = (v_{m - 1}^{0} v_{m - 2}^{0} \dots v_{0}^{0}) \leftarrow (00 \dots 0)

C^{0} = (c_{m - 1}^{0} c_{m - 2}^{0} \dots c_{0}^{0}) \leftarrow (c_{m - 1} c_{m - 2} \dots c_{0})

Algorithm:

1:: for $1 \leq i \leq (m + 1) / 2$ do
2:: $c_{0}^{i} = c_{m - 1}^{i - 1}$
3:: for $1 \leq j \leq m$ do
4:: $c_{j}^{i} = c_{j - 1}^{i - 1} + c_{m - 1}^{i - 1} h_{j}$
5:: $u_{j - 1}^{i} = u_{j - 1}^{i - 1} + d_{(m + 2 i - 3) / 2} c_{j - 1}^{i - 1}$
6:: end for
7:: end for
8:: for $1 \leq i \leq (m + 1) / 2$ do
9:: $c_{m - 1}^{i} = c_{0}^{i - 1}$
10:: for $1 \leq j \leq m$ do
11:: $c_{m - j - 1}^{i} = c_{m - j}^{i - 1} + c_{0}^{i - 1} h_{m - j}$
12:: $v_{m - j}^{i} = v_{m - j}^{i - 1} + d_{(m - 2 i + 1) / 2} c_{m - j}^{i - 1}$
13:: end for
14:: end for
15:: for $1 \leq j \leq m$ do
16:: $t_{j - 1} = u_{j - 1}^{(m + 1) / 2} + v_{j - 1}^{(m + 1) / 2}$
17:: end for

4. Dependency Graph

The iterative portion of the Montgomery multiplication algorithm involves recursive Equations (3), (6), (8) and (11) that describe the computation steps. These equations have a similar computation structure but differ in terms of their coordinate directions. To better understand the computational dependencies and patterns in the algorithm, we can represent them using DGs. These DGs provide a visual representation of the computations involved in the iterative portion of the algorithm. In this case, we have derived DGs for a specific field size of

m = 5

, which are displayed in Figure 2 and Figure 3. These graphs are defined in a two-dimensional integer domain

D

, with indices i and j indicating the nodes in the graph.

By analyzing the DGs, we can gain insights into the parallelism and coordination of the computations in the algorithm. Each node in the DG represents a computation step, and the edges between nodes indicate the dependencies between these steps. The iterative formulas of the algorithm are computed by these nodes, which can be organized and scheduled to exploit parallelism and optimize performance.

The DGs provide valuable information for designing efficient hardware architectures for the Montgomery multiplication algorithm. By understanding the dependencies and patterns in the iterative computations, it is possible to develop specialized structures that maximize parallelism, minimize resource usage, and optimize performance. These architectures can be particularly beneficial in the context of VLSI implementation, where efficient hardware utilization is crucial.

In Figure 2, the data entry points within the DG are organized along designated pathways to facilitate an orderly processing sequence. The following section outlines the various input pathways along with their associated signals:

Left-to-Right pathway: This pathway serves to introduce the input signals $d_{(m + 2 i - 3) / 2}$ , with $1 \leq i \leq (m + 1) / 2$ . These signals are fed into the DG from the left edge, ensuring a smooth integration into the processing sequence. The notation $(m + 2 i - 3) / 2$ indicates the specific indexing of the input signals based on the field size m.
Top Entry Point: The top portion of the DG serves as the entry point for the input signals $h_{j}$ and the initial zero values of signals $u_{j - 1}$ , where $1 \leq j \leq m$ .
Red Slanted Lines: The reddish angled lines at the rightmost spots of the entry points are for the insertion of the input signals $c_{j - 1}$ , where $1 \leq j \leq m$ .

Inside DG, each node computes the intermediate outputs of the coefficients associated with the variable U. Once calculated, these outputs are transmitted to the nodes located in the subsequent row, adhering to the pathways indicated by the arrows in the graph. This procedure persists until the final row of the DG is attained. At the bottom of the DG, the resulting coefficients of variable U are produced. These coefficients represent the final output of the computation process. By organizing the inputs and computations in this manner, the DG facilitates the calculation of intermediate partial products and the generation of the final output coefficients of variable U. The directional flow ensures that the computations are performed in the desired order and that the dependencies between the nodes are properly maintained.

In a similar fashion, the data entering the DG depicted in Figure 3 are strategically organized along specific pathways to enable the necessary calculations to occur. Let us take a closer look at these input paths and the various types of signals they convey:

Right-to-Left pathway: The pathway moving from right to left serves to introduce the input signal $d_{(m - 2 i + 1) / 2}$ , where i is constrained to be between 1 and $(m + 1) / 2$ . These data points are fed into the DG from the lateral right edge.
Bottom Entry Point: The lower section of the DG serves as the designated area for inserting the input signal $h_{m - j}$ alongside the starting values of zero for the signal $v_{m - j}$ . This insertion takes place for indices satisfying the condition $1 \leq j \leq m$ .
Red Slanted Lines: The red slanted lines depicted at the left corners of the input nodes are used to insert the input signals $c_{m - j}$ , where $1 \leq j \leq m$ .

Within the DG, each node calculates the cumulative totals of the V data coefficients. These totals are then passed on to the nodes in the next row, following the paths indicated by the arrows in the diagram. This ensures that each node builds upon the information received from the previous row. This iterative process continues until the uppermost row of the DG is attained, providing a comprehensive overview of the data. At the top of the DG, the resulting coefficients of variable V are produced. These coefficients represent the final output of the computational process.

The final product T is obtained by summing the resulting coefficients of

U^{(m + 1) / 2}

and

V^{(m + 1) / 2}

. To implement this summation, 2-input XOR gates can be utilized. Each corresponding bit of the coefficients is connected to an input of the XOR gate. The outputs of the XOR gates are then combined to obtain the final result T.

5. Development of the Semi-Systolic Multiplier Architecture

In this section, we utilize the previously outlined approach by the authors to develop a one-dimensional concurrent semi-systolic multiplier framework [55,56,57]. This method incorporates scheduling and node projection techniques, which are implemented within the Data Graphs (DGs) to create a high-performance parallel multiplier configuration derived from the selected Montgomery algorithm. We will apply this approach to the first DG illustrated in Figure 2. The second DG, depicted in Figure 3, features a similar configuration but differs in the orientation of its coordinates. Despite these directional variations, both structures will yield the same semi-systolic configuration, with signals directed oppositely.

5.1. Scheduling Function

In analyzing the structure of the DG shown in Figure 2, we consider the individual points (nodes) that are represented as coordinates. Each of these points is denoted as

p (i, j) = [i j]

. To determine the execution sequence for each point, we utilize a scheduling vector

s = [s_{0} s_{1}]

, along with a corresponding scheduling function. The scheduling function, represented as

L (p)

, can be defined as:

L (p) = s p - k = i s_{0} + j s_{1} - k

(12)

where

p

represents the position vector

[i j]

, which indicates the specific location within the DG, while k is a scalar value that may denote a constant or a specific parameter relevant to our calculations. The purpose of incorporating k is to ensure that only positive time values are assigned to the DG nodes. In this case, selecting

k \equiv 0

guarantees that all nodes in the DG shown in Figure 2 are allocated positive time values.

The scheduling vector

s = [s_{0} s_{1}]

must adhere to specific constraints. One important constraint stipulates that nodes positioned at

p = [i, j]

should only begin execution after the completion of the nodes located at

p = [i - 1, j]

. This requirement ensures proper sequencing of operations and can be expressed as the inequality:

L (p = [i, j]) > L (p = [i - 1, j])

(13)

When considering the coordinate values of the scheduling vector

s

, this inequality can be formulated as:

\begin{matrix} i s_{0} + j s_{1} & > & (i - 1) s_{0} + j s_{1} \\ s_{0} & > & 0 \end{matrix}

(14)

The previously outlined condition on the timing vector

s

stipulates that the time value designated for a node at

p = [i, j]

must exceed the time value designated for the node at

p = [i - 1, j]

. This requirement is crucial for maintaining the correct order of execution, as it prevents any potential conflicts that could arise from overlapping operations. By following this constraint, the desired operational sequence of nodes within the DG can be effectively realized, ultimately leading to a more efficient processing of tasks and better resource management.

Based on the given iterations in Equations (3), (6), (8), and (11), we have an additional timing restriction. In particular, it is imperative that tasks assigned to points

p = [i, j + 1]

are executed only after the completion of tasks associated with points

p = [i - 1, j]

. This sequential dependency is critical for maintaining the integrity of the operations, as it ensures that all prerequisite tasks are completed before proceeding to subsequent ones. This relationship can be formally expressed as:

L (p = [i, j + 1]) > L (p = [i - 1, j])

(15)

Expanding the expressions using the coordinate values of

s

, we have:

\begin{matrix} i s_{0} + j s_{1} + s_{1} & > & i s_{0} - s_{0} + j s_{1} \\ s_{1} & > & - s_{0} \end{matrix}

(16)

This relational expression delineates the temporal constraint imposed on the scheduling vector

s

. It verifies that the time indicator of a node situated at

p = [i, j + 1]

must exceed the time indicator of the node positioned at

p = [i - 1, j]

. By enforcing this limitation, the intended sequence of operational execution within the DG is secured, thereby facilitating a more efficient and coherent workflow in task processing.

By examining the relational Expressions (14) and (16), we can identify appropriate scheduling vectors that satisfy the established constraints. These expressions provide a framework for analyzing the dependencies between tasks, enabling us to derive vectors that ensure the correct sequence of execution. A viable option for a compliant scheduling vector is:

\begin{matrix} s & = & [\begin{matrix} 1 & 0 \end{matrix}] \end{matrix}

(17)

The timing arrangement of the nodes, following the implementation of this scheduling vector on the DGs, is visualized in Figure 4 and Figure 5. A review of these diagrams reveals that the incoming signals

c_{j - 1}

,

c_{m - j}

,

h_{j}

, and

h_{m - j}

are processed concurrently. Following

(m + 1) / 2

clock cycles, the produced signals

u_{j - 1}

and

v_{m - j}

(for values of j ranging from 1 to m) are produced simultaneously. This observation highlights the effectiveness of the scheduling vector in optimizing task execution within the DGs.

5.2. Projection Function

Based on the research performed by the second contributor, the projection function is essential for converting a multitude of DG nodes or locations, represented by

p (i, j)

, into one unified processing element

\bar{p}

[55]. This transformation facilitates efficient data handling and processing within the system. Through the interconnection of these processing elements, a systolic or semi-systolic arrangement is formed, optimizing performance and resource utilization. The projection function can be represented as:

\bar{p} = G p

(18)

In this formulation,

G

denotes the projection matrix. To establish this projection matrix, it is necessary to determine the null space associated with it, referred to as

E

. As highlighted in [55], it is vital to apply a particular constraint on the projection vector

E

to guarantee both robustness and accuracy in the resulting projections. This constraint is instrumental in maintaining the fidelity of the data being processed, thereby enhancing the effective use of the projection matrix in a range of applications.

sE \neq 0

(19)

This constraint ensures that each processing element carries out its specific tasks at varied times, which allows for enhanced efficiency in the utilization of the processing elements through multiplexing. By distributing the workload across different time intervals, the system can minimize idle time and maximize throughput.

By considering the constraint outlined in Equation (19), along with the specified scheduling vector

s = [1 0]

, we can better understand the implications for system design. Furthermore, given the requirement for a bit-parallel semi-systolic architecture, the projection vector that fulfills these specific conditions can be articulated as follows:

\begin{matrix} E & = & [\begin{matrix} 1 & 0 \end{matrix}] \end{matrix}

(20)

This particular projection vector guarantees that both the scheduling vector

s

and the projection vector

E

conform to the constraint specified in Equation (19). Consequently, this alignment not only facilitates the development of a bit-parallel semi-systolic structure but also ensures that it exhibits the required characteristics for optimal performance in practical applications, ultimately contributing to improved efficiency and reliability in system operations.

The projection matrix

G

can be derived based on the fact that

E

is the null space of

G

. It can be expressed as:

\begin{matrix} G & = & [\begin{matrix} 0 & 1 \end{matrix}] \end{matrix}

(21)

This matrix serves to illustrate the transformation from the primary set of Directed Graph (DG) nodes or points to a singular processing element. This mapping not only clarifies the relationships among the nodes but also enhances the efficiency of data management within the processing architecture, ensuring that each node is effectively integrated into the computational framework.

5.3. Exploring the Design of Semi-Systolic Multiplier Layout

The functions

L (p)

and

\bar{p} (p)

corresponding to each node

p [i, j]

within the DG presented in Figure 2 can be formulated by incorporating the vectors

s = [1 0]

and

G = [0 1]

into the Expressions (12) and (18). This methodological integration facilitates a precise definition of the resulting functions, which are integral to the analysis of the graph’s structural properties and operational dynamics. The resultant functions can be delineated as follows:

\begin{matrix} L (p) = i \\ \bar{p} (p) = j \end{matrix}

(22)

Applying a similar procedure to each DG node

p [i, m - j]

in Figure 3, we can determine the functions

L (p)

and

\bar{p} (p)

as follows:

\begin{matrix} L (p) = i \\ \bar{p} (p) = m - j \end{matrix}

(23)

By utilizing the derived functions

L (p) = i

and

\bar{p} (p) = m - j

for the DG nodes of Figure 2 and Figure 3, we can construct a bit-parallel semi-systolic multiplier structure, which is a computational architecture commonly used for high-performance multiplication operations. The structure is depicted in Figure 6 and consists of two one-dimensional semi-systolic arrays. Each array is composed of m Processing Elements (PEs) arranged in a linear fashion.

The upper semi-systolic array is responsible for computing the coefficients of the polynomial U. The internal logic of the intermediate PE is illustrated in Figure 7. This PE performs the necessary computations to generate the coefficients of the polynomial U based on the inputs it receives. It is worth noting that the first and last PEs shown in Figure 8 and Figure 9, respectively, are simplified versions of the intermediate PE. These simplified PEs may have reduced functionality compared to the intermediate PE, as they are located at the boundaries of the upper semi-systolic array.

The lower semi-systolic array, depicted in Figure 6, is responsible for computing the coefficients of the polynomial V. Each PE within the lower semi-systolic array performs specific computations to generate the coefficients of V based on the inputs it receives. The internal logic of the intermediate PE in the lower semi-systolic array is illustrated in Figure 10. This PE incorporates the necessary operations and calculations to compute the coefficients of the polynomial V. It is worth mentioning that the first and last PEs shown in Figure 11 and Figure 12, respectively, are simplified versions of the intermediate PE. These simplified PEs are designed to accommodate the specific requirements and constraints at the boundaries of the lower semi-systolic array. Their simplified design may involve reduced functionality compared to the intermediate PE.

In Figure 6, it can be observed that the inputs

u_{j - 1}^{0}

and

v_{m - j}^{0}

are both equal to zero. This observation allows for optimizations in terms of area and delay complexities by resetting the

D_{u}

and

D_{v}

flip-flops (shown in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12) at the appropriate time. By resetting these flip-flops, the required zero values can be directly presented at the input of the XOR gate, eliminating the need for additional logic to compute those values. Furthermore, in the last PE,

{PE}_{m}

, there is no need to compute the updated values of

c_{j}^{i}

and

c_{m - j - 1}^{i}

. Therefore, the logic structure responsible for computing these variables can be removed from

{PE}_{m}

, resulting in a simplified design as shown in Figure 9 and Figure 12, respectively. This modification significantly reduces the area overhead of the semi-systolic array.

In contrast to the regular PEs, the first PE,

{PE}_{1}

, utilizes the signals

c_{m - 1}^{i - 1}

and

c_{0}^{i - 1}

as inputs to the multiplexer

M_{c}

instead of

c_{j - 1}^{i - 1}

and

c_{m - j}^{i - 1}

. This configuration allows for assigning the signals

c_{m - 1}^{i - 1}

and

c_{0}^{i - 1}

to the corresponding signals

c_{0}^{i}

and

c_{m - 1}^{i}

, effectively implementing the rotate right operation required in the computation process. Overall, these modifications result in a more optimized and streamlined semi-systolic array, reducing the spatial footprint and improving operational effectiveness by eliminating extraneous computations and refining the logical framework.

The introduced semi-systolic multiplier represents a noteworthy advancement compared to previously documented bi-dimensional parallel systolic architectures in terms of spatial complexity. While conventional bi-dimensional parallel systolic architectures typically exhibit a spatial complexity of magnitude

O (m^{2})

, the proposed semi-systolic multiplier achieves a more favorable spatial complexity of magnitude

O (m)

, thus enabling more effective resource allocation. Specifically, when assessed against the Montgomery bi-dimensional parallel semi-systolic configurations detailed in [52,53], the proposed multiplier configuration showcases a superior spatial footprint. This reduction in spatial complexity facilitates a more optimal arrangement of hardware resources, thereby establishing the introduced multiplier configuration as a leading option in spatial performance. Furthermore, in contrast to parallel multipliers based on established field multiplication techniques, such as those referenced in [23,24,48,49,51,58,59], the proposed multiplier configuration presents significant advantages in spatial footprint. The forthcoming results and performance metrics will be elaborated upon in the results section, highlighting the superiority of the introduced multiplier configuration regarding both spatial and power complexities.

The layout of the developed parallel semi-systolic multiplier, based on the information from Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, can be described as follows:

Input Signals: The input signals $c_{j - 1}$ , $c_{m - j}$ , $h_{j}$ , and $h_{m - j}$ , where $1 \leq j \leq m$ , are assigned to each PE within the layout.
Zero Initialization: The initial values of the input signals $u_{j - 1}$ and $v_{m - j}$ , where $1 \leq j \leq m$ , are set to zero. As a result, the inputs near the left corners of the upper PEs and the inputs near the right corners of the lower PEs are assigned zero values.
Sequential Input Signals: The input signals $d_{(m + 2 i - 3) / 2}$ and $d_{(m - 2 i + 1) / 2}$ , where $1 \leq i \leq (m + 1) / 2$ , are fed sequentially and pass through all the PEs. These input signals propagate through the PEs in a regular sequence.
Intermediate Signal Generation: Each PE generates intermediate signals $c_{m - j - 1}^{i}$ and $c_{j}^{i}$ , where $1 \leq i \leq (m + 1) / 2$ and $1 \leq j \leq m$ . These intermediate signals are computed based on the inputs received by each PE. The intermediate signals are pipelined through $D_{c}$ latches (solid red box in Figure 7, Figure 9, Figure 10 and Figure 12) and passed to the next PE.
Parallel Output: The resulting bits of $u_{j - 1}^{(m + 1) / 2}$ and $v_{m - j}^{(m + 1) / 2}$ , with j taking values from 1 to m, are available simultaneously at the outputs of all PEs after $(m + 1) / 2$ clock cycles. In the final product calculation stage, during clock cycle $(m + 1) / 2$ , the final product bits $t_{j - 1}$ , corresponding to j from 1 to m, are obtained by executing XOR operations using 2-input XOR gates. The resulting bits from XORing the matching bits of $u_{j - 1}^{(m + 1) / 2}$ and $v_{j - 1}^{(m + 1) / 2}$ contribute to the formation of the final outcome, $t_{j - 1}$ .
Final Product Calculation: The final product bits $t_{j - 1}$ , where $1 \leq j \leq m$ , are obtained at clock cycle $(m + 1) / 2$ by adding (using 2-input XOR gates) the bits of $u_{j - 1}^{(m + 1) / 2}$ and $v_{j - 1}^{(m + 1) / 2}$ .

The studied bit-parallel semi-systolic multiplier structure’s operation can be described in the following sequence:

Initialization: In the first clock period, the latches $D_{u}$ and $D_{v}$ are reset, causing the input bits $u_{j - 1}$ and $v_{m - j}$ (where $1 \leq j \leq m$ ) to be set to zero. Simultaneously, the control signal of MUX $M_{c}$ is deactivated, allowing the input signals $c_{j - 1}$ and $c_{m - j}$ (where $1 \leq j \leq m$ ) to be passed to the corresponding PEs within the layout.
Computation: Starting from the second clock period until clock period $(m + 1) / 2$ , the control signal of MUX $M_{c}$ is activated. This enables the intermediate signals $c_{j - 1}^{i - 1}$ and $c_{m - j}^{i - 1}$ (where $1 \leq j \leq m$ ) to be passed through the PEs for the computation of the intermediate values $c_{j}^{i}$ , $c_{m - j - 1}^{i - 1}$ , $u_{j - 1}^{i}$ , and $v_{m - j}^{i}$ (where $1 \leq j \leq m$ ). Additionally, during these clock cycles, the input signals $d_{(m + 2 i - 3) / 2}$ and $d_{(m - 2 i + 1) / 2}$ are sequentially fed into the system.
Parallel Output: At clock period $(m + 1) / 2$ , the parallel output bits of the product T, denoted as $t_{j - 1}$ (where $1 \leq j \leq m$ ), are simultaneously produced at the outputs of the XOR gates depicted in Figure 6.

6. Results and Discussion

In this section, we will assess the recommended bit-parallel semi-systolic multiplier alongside a range of notable systolic and semi-systolic multiplier frameworks documented in the current literature [23,24,48,51,52,53,58]. This section is structured into two distinct subsections. In the first subsection, we will examine the spatial and time-related complexities of the recommended configuration in conjunction with those of competing architectures. By thoroughly investigating and contrasting these complexities, we aim to uncover valuable insights into the resource allocation and speed performance of our configuration when compared to rival designs. Transitioning to the second subsection, we will substantiate the findings from the complexity analysis through practical implementation. By realizing the recommended configuration in a real-world setting, we can effectively evaluate its actual performance and juxtapose it against the anticipated complexities. This hands-on implementation will ensure that our complexity analysis accurately reflects the operational behavior of the recommended multiplier in practical applications, validating the theoretical insights gained earlier.

6.1. Complexity Analysis

Upon closer examination of the provided semi-systolic architecture depicted in Figure 6, we can deduce that it consists of a total of

2 m

PEs. Each PE is composed of a varied set of components, including

4 m - 2

AND gates,

4 m - 2

XOR gates,

2 m

MUX selectors, and

6 m - 2

storage latches. These essential components work cohesively to carry out the necessary calculations within each PE, thereby ensuring optimal processing efficiency and enhancing system performance.

To generate the output bits, represented specifically as

t_{j}

, an additional array of m XOR gates is effectively employed in the architecture. These particular XOR gates have the specific role of merging the corresponding bits of

u_{j - 1}^{(m + 1) / 2}

and

v_{m - j}^{(m + 1) / 2}

, where j takes on values from 1 up to m, thereby facilitating accurate computations. As a result, the total number of XOR gates required by this architectural design ultimately amounts to

5 m - 2

.

To accurately assess the operational speed of the newly introduced multiplier, it is essential to evaluate the Critical Path Delay (CPD), as this metric serves as a crucial measure of overall performance and efficiency in the system. The critical path refers to the longest path that determines the overall delay in the circuit. In this case, the critical path consists of a 2-to-1 MUX (

T_{M}

) and a 2-input XOR gate (

T_{X}

). By carefully analyzing the logic within the PEs, we can calculate the cumulative propagation delays of these components and determine the CPD.

Taking into account the operational characteristics of the framework, it is crucial to observe that the newly introduced multiplier generates its final results within

(m + 1) / 2

clock intervals. This noteworthy aspect signifies that the complete computation, which begins with the initiation of the multiplication process and extends to the generation of the ultimate output bits, is finalized within

(m + 1) / 2

clock cycles. Such information is invaluable as it enables us to thoroughly evaluate the overall effectiveness and speed of the multiplier in this context.

Table 3 presents a comprehensive comparison between the suggested semi-systolic multiplier arrangement and several existing parallel systolic/semi-systolic multiplier constructions [23,24,48,51,52,53,58]. The comparison is based on three key aspects: the overall utilization of components (gates, MUXs, and latches), latency, and CPD. From the results presented in Table 3, it is evident that the spatial complexity of the multiplier layouts proposed in [23,24,48,51,52,53,58] exhibits an order of

O (m^{2})

. This indicates that the number of components required grows quadratically with the input size m. In contrast, the suggested semi-systolic multiplier arrangement demonstrates a spatial complexity of

O (m)

, which signifies a significant reduction in resource utilization. This reduction in space complexity has practical implications, especially for RFID assistive devices where limited resources and area constraints are common. Additionally, the comparison in Table 3 reveals that all the considered designs exhibit a time complexity of

O (m)

. This means that the proposed semi-systolic multiplier arrangement achieves comparable computational efficiency in terms of time complexity while utilizing significantly fewer resources. The ability to achieve similar performance with reduced resource utilization is a desirable characteristic for many practical applications.

The suggested semi-systolic multiplier layout offers several advantages that make it particularly suitable for RFID assistive devices. Firstly, its space-efficient design leads to reduced area requirements, enabling efficient utilization of available hardware resources. This reduction in space complexity has a direct impact on the Area-Delay Product (ADP) and Power-Delay Product (PDP) of the multiplier, resulting in improved overall performance and energy efficiency.

The superiority of the suggested multiplier arrangement is further supported by real implementation results provided in Table 4. These results validate the claims regarding reduced space complexity, improved ADP, and PDP. By reducing resource utilization while maintaining comparable performance, the suggested multiplier layout offers practical benefits for RFID-Based assistive devices where power consumption, area utilization, and overall efficiency are critical considerations.

6.2. Implementation Results

The suggested semi-systolic multiplier arrangement and the existing systolic/semi-systolic multiplier constructions [23,24,48,51,52,53,58] were thoroughly evaluated and compared using a comprehensive methodology. The VHDL programming language was employed to meticulously design and implement the various configurations of the multiplier. Additionally, the transformation phase was expertly carried out using the Synopsys Design Compiler, leveraging the Nangate library (15 nm, 0.8 V) for optimal results. This comprehensive setup allows for accurate and detailed evaluations of key performance metrics, including area footprint, latency, and power consumption, ensuring a thorough analysis at a highly detailed resolution. To ensure the correctness and functionality of the multiplier designs, thorough functional verification was conducted using ModelSim’s simulation tools. This verification process involved extensive testbench development and simulation runs to validate the correctness of the multiplier’s output for various input scenarios. Only after the functional verification process was successfully completed, the multiplier designs proceeded to synthesis. During the synthesis phase, the VHDL code for each specific multiplier layout underwent a careful transformation into a gate-level netlist utilizing the Synopsys Design Compiler. The Nangate library, which provides crucial technology-specific details such as gate dimensions, interconnect delays, and power characteristics, was integral to the synthesis process. The Design Compiler optimized the netlist according to defined constraints, such as area and power targets, ultimately producing a refined gate-level implementation for each distinct multiplier design. Once the synthesis process was completed, the area, delay, and power consumption metrics for each design were meticulously gathered from the resulting synthesized netlists. These vital metrics played a key role in enabling a thorough evaluation and comparison of the performance characteristics across the various multiplier layouts.

Table 4 presents a comprehensive overview of the synthesized results for the suggested semi-systolic multiplier architecture, which is analyzed in comparison with existing designs for field sizes

m = 409

and

m = 571

[23,24,48,51,52,53,58]. The key performance metrics of interest—namely, area, delay, power consumption, ADP, and PDP—were meticulously derived from the synthesis results obtained across the various implementations. This detailed analysis not only highlights the strengths and weaknesses of each design but also enables a more insightful evaluation of the performance differences among them.

By inspecting the eighth and ninth columns of Table 4, it is evident that the suggested semi-systolic multiplier arrangement achieves significant reductions in both spatial utilization and power consumption when contrasted with the current frameworks. The reduction in spatial usage ranges from 99.4% to 99.7% for both field sizes, indicating a substantial decrease in the required hardware resources, which is critical in applications where space is limited. Similarly, the power consumption reduction ranges from 91.6% to 95.9% for

m = 409

and 92.3% to 95.9% for

m = 571

, reflecting a significant improvement in energy efficiency that is essential for prolonging the operational lifespan of battery-powered devices. Indeed, these results highlight that the suggested bit-parallel semi-systolic multiplier demonstrates significant space and power savings, making it highly suitable for deployment in resource-constrained RFID sensor tags tailored for disabled individuals, where minimizing hardware resources and power consumption is of utmost importance.

Upon analyzing the fourth column of Table 4, it is worth noting that the suggested design exhibits slightly higher delay compared to some of the existing designs. This is primarily due to the slightly increased CPD in the proposed layout. The CPD represents the longest delay path in the multiplier circuit and can affect the overall performance. Despite this slight increase in delay, the suggested design still offers comparable computational efficiency in terms of time complexity, making it suitable for RFID tags, where timely computations are essential for user experience. In assistive devices reliant on real-time data processing, even minor delays can adversely affect functionality and satisfaction. Moreover, the design’s efficiency in handling multiplication operations supports the used complex ECC cryptographic algorithm. Ultimately, the combination of space and power savings with a comparable delay underscores the viability of this design in enhancing the performance of RFID assistive system that require real-time processing.

In examining the last two columns of Table 4, it is crucial to emphasize that the proposed semi-systolic multiplier arrangement significantly outperforms existing designs in terms of ADP and PDP, achieving ADP reductions between 99.3% and 99.9% for both field sizes. This improvement is crucial for RFID tags tailored for individuals with disabilities, as lower ADP values facilitate compact integration into assistive devices without sacrificing performance. Moreover, the reductions in power-delay product (PDP)—ranging from 90.9% to 98.5% for

m = 409

and from 91.4% to 98.6% for

m = 571

—underscore significant improvements in energy efficiency, which are crucial for battery-operated devices. Extended battery life can directly improve usability and reliability, allowing users to depend on these technologies throughout their daily activities. The design’s ability to maintain high performance while minimizing area and power usage aligns with the real-time data processing needs of assistive technologies. Improved energy management not only enhances user experience but also supports the development of sustainable solutions, ensuring that RFID systems provide critical support without imposing additional burdens on users. Ultimately, these results underscore the design’s potential to significantly enhance the performance and efficiency of RFID tags, making them more accessible and effective for individuals with disabilities.

Considering the aforementioned deign features, the proposed bit-parallel semi-systolic multiplier architecture presents significant advantages in terms of resource efficiency and energy conservation, thereby rendering it particularly suitable for deployment in RFID assistive devices characterized by limited computational resources. Optimizing energy consumption in such devices is crucial, as it can lead to extended operational lifetimes, reduced maintenance costs, and enhanced overall user satisfaction. Furthermore, the architecture’s capacity to deliver robust performance within a compact form factor facilitates seamless integration into assistive technologies, ultimately improving accessibility and enriching the user experience. By effectively balancing high performance with minimal resource consumption, this design addresses the challenges associated with real-time data processing while promoting the development of sustainable technological solutions. Additionally, the architecture significantly enhances the implementation of complex ECC algorithms on RFID tags, as it is central to ECC’s operational functionality. ECC has been recognized for its effectiveness in safeguarding sensitive data within assistive technologies, with empirical studies demonstrating that its adoption can substantially improve security levels, thereby ensuring greater reliability and trustworthiness for users.

In conclusion, the findings affirm the architecture’s potential as a transformative solution for RFID assistive systems, markedly enhancing security for disabled users who depend on these essential technologies. This architecture not only meets operational requirements but also aligns with broader objectives aimed at promoting secure accessibility for individuals with disabilities. Recent evaluations indicate that enhancing security features in assistive devices not only protects sensitive information but also bolsters user confidence, ultimately empowering users with disabilities and improving their overall quality of life.

7. Summary and Conclusions

This research work focused on enhancing cryptographic protocols in low-cost RFID assistive devices. To achieve this goal, we considered the implementation of the essential operation of these protocols, which is finite field multiplication. We developed a novel and highly efficient one-dimensional bit-parallel semi-systolic array layout for polynomial-basis Montgomery multiplication in GF(

2^{m}

). The approach adopted in this work involves applying a conventional iterative algorithm represented by a DG. By assigning suitable scheduling and node projection functions to each node in the DG, a practical construction of a bit-parallel semi-systolic multiplier is achieved. This innovative design enables efficient and high-speed multiplication operations in GF(

2^{m}

) using the Montgomery multiplication technique. The key advantage of the proposed one-dimensional parallel arrangement is its significantly reduced spatial complexity. Unlike previous parallel structures with a spatial complexity of

O (m^{2})

, the suggested layout exhibits a spatial complexity of

O (m)

. This represents a substantial improvement in terms of resource utilization, making it highly suitable for VLSI implementation. The complexity analysis demonstrated that the suggested multiplier has a considerably smaller total area, further validating its efficiency and suitability for implementation. To evaluate the effectiveness of the recommended multiplier construction, both the proposed design and previously presented multiplier layouts were synthesized using the ASIC CMOS library. The findings from the synthesis process confirmed the substantial area and power reductions achieved by the proposed multiplier architecture. Additionally, important metrics such as PDP and ADP showed significant savings, reinforcing the efficiency gains of the suggested design. Based on these findings, it is evident that the proposed multiplier paradigm significantly enhances the implementation of complex ECC algorithm on low-cost RFID tags employed in assistive systems, ensuring greater reliability and trustworthiness for users.

Author Contributions

Conceptualization, A.I.; methodology, A.I. and F.G.; software, A.I.; validation, A.I.; formal analysis, A.I.; investigation, A.I.; resources, A.I.; data curation, A.I.; writing—original draft preparation, A.I.; writing—review and editing, A.I. and F.G.; visualization, A.I.; supervision, A.I.; project administration, A.I. and F.G.; funding acquisition, A.I. All authors have read and agreed to the published version of the manuscript.

Funding

King Salman center For Disability Research, project number KSRG-2024-207.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors extend their appreciation to the King Salman center For Disability Research for funding this work through Research Group no KSRG-2024-207.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IoT	Internet of Things
RFID	Radio Frequency Identification
ADP	Area-Delay Product
PDP	Power-Delay Product
ASIC	Application Specific Integrated Circuit
RSA	Rivest, Shamir, and Adleman
ECC	Elliptic Curve Cryptography
DoS	Denial of Service
MitM	Man-in-the-Middle
DG	Dependency Graph
CPD	Critical Path Delay

References

Semary, H.; Al-Karawi, K.A.; Abdelwahab, M.M.; Elshabrawy, A. A Review on Internet of Things (IoT)-Related Disabilities and Their Implications. J. Disabil. Res. 2024, 3, 20240012. [Google Scholar] [CrossRef]
Giannakas, F.; Troussas, C.; Krouska, A.; Voyiatzis, I.; Sgouropoulou, C. Blending cybersecurity education with IoT devices: A u-Learning scenario for introducing the man-in-the-middle attack. Inf. Secur. J. A Glob. Perspect. 2023, 32, 371–382. [Google Scholar]
Ahmad Awan, K.; Ud Din, I.; Al-Huqail, A.A.; Almogren, A. SecuTwin for All: Enhancing Disability-focused Healthcare Through Secure Digital Twin Technology and Connected Health Monitoring. J. Disabil. Res. 2024, 3, 20240093. [Google Scholar] [CrossRef]
Lee, T.F.; Lin, K.W.; Hsieh, Y.P.; Lee, K.C. Lightweight cloud computing-based RFID authentication protocols using PUF for e-healthcare systems. IEEE Sens. J. 2023, 23, 6338–6349. [Google Scholar]
Das, S.; Namasudra, S.; Deb, S.; Ger, P.M.; Crespo, R.G. Securing iot-based smart healthcare systems by using advanced lightweight privacy-preserving authentication scheme. IEEE Internet Things J. 2023, 10, 18486–18494. [Google Scholar]
He, D.; Zeadally, S. An analysis of RFID authentication schemes for internet of things in healthcare environment using elliptic curve cryptography. IEEE Internet Things J. 2014, 2, 72–83. [Google Scholar] [CrossRef]
Fan, K.; Jiang, W.; Li, H.; Yang, Y. Lightweight RFID protocol for medical privacy protection in IoT. IEEE Trans. Ind. Inform. 2018, 14, 1656–1665. [Google Scholar] [CrossRef]
Qiu, S.; Xu, G.; Ahmad, H.; Wang, L. A robust mutual authentication scheme based on elliptic curve cryptography for telecare medical information systems. IEEE Access 2017, 6, 7452–7463. [Google Scholar] [CrossRef]
Periša, M.; Teskera, P.; Cvitić, I.; Grgurević, I. Empowering People with Disabilities in Smart Homes Using Predictive Informing. Sensors 2025, 25, 284. [Google Scholar] [CrossRef]
Vrančić, A.; Zadravec, H.; Orehovački, T. The role of smart homes in providing care for older adults: A systematic literature review from 2010 to 2023. Smart Cities 2024, 7, 1502–1550. [Google Scholar] [CrossRef]
Fizza, K.; Jayaraman, P.P.; Banerjee, A.; Auluck, N.; Ranjan, R. IoT-QWatch: A novel framework to support the development of quality-aware autonomic IoT applications. IEEE Internet Things J. 2023, 10, 17666–17679. [Google Scholar] [CrossRef]
Khadka, G.; Ray, B.; Karmakar, N.C.; Choi, J. Physical-layer detection and security of printed chipless RFID tag for internet of things applications. IEEE Internet Things J. 2022, 9, 15714–15724. [Google Scholar] [CrossRef]
Ghosh, H.; Maurya, P.K.; Bagchi, S. Secret sharing based RFID protocol using ECC for TMIS. Peer-to-Peer Netw. Appl. 2024, 17, 624–638. [Google Scholar]
Cetintav, I.; Sandikkaya, M.T. A Review of Lightweight IoT Authentication Protocols from the Perspective of Security Requirements, Computation, Communication, and Hardware Costs. IEEE Access 2025, 13, 37703–37723. [Google Scholar]
Kumar, S.; Banka, H.; Kaushik, B. Lightweight group authentication protocol for secure RFID system. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–29. [Google Scholar]
Vijaykumar, V.R.; Sekar, S.R.; Jothin, R.; Diniesh, V.C.; Elango, S.; Ramakrishnan, S. Novel Light Weight Hardware Authentication Protocol for Resource Constrained IOT Based Devices. IEEE J. Radio Freq. Identif. 2024, 8, 31–42. [Google Scholar] [CrossRef]
Shihab, S.; AlTawy, R. Lightweight authentication scheme for healthcare with robustness to desynchronization attacks. IEEE Internet Things J. 2023, 10, 18140–18153. [Google Scholar] [CrossRef]
Wang, Y.; Liu, R.; Gao, T.; Shu, F.; Lei, X.; Wu, Y.; Gui, G.; Wang, J. A novel RFID authentication protocol based on a block-order-modulus variable matrix encryption algorithm. arXiv 2024, arXiv:2312.10593. [Google Scholar]
Dobraunig, C.; Eichlseder, M.; Mendel, F.; Schläffer, M. Ascon v1. 2: Lightweight authenticated encryption and hashing. J. Cryptol. 2021, 34, 33. [Google Scholar] [CrossRef]
Banik, S.; Chakraborti, A.; Inoue, A.; Iwata, T.; Minematsu, K.; Nandi, M.; Peyrin, T.; Sasaki, Y.; Sim, S.M.; Todo, Y. Gift-cofb. Cryptology 2020. ePrint Archive. [Google Scholar]
Chen, C.C.; Lee, C.Y.; Lu, E.H. Scalable and Systolic Montgomery Multipliers Over GF(2^m). IEICE Trans. Fundam. 2008, E91-A, 1763–1771. [Google Scholar]
Chiou, C.W.; Lee, C.Y.; Deng, A.W.; Lin, J.M. Concurrent error detection in Montgomery multiplication over GF(2^m). IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2006, 89, 566–574. [Google Scholar] [CrossRef]
Huang, W.T.; Chang, C.; Chiou, C.; Chou, F. Concurrent error detection and correction in a polynomial basis multiplier over GF(2^m). IET Inf. Secur. 2010, 4, 111–124. [Google Scholar] [CrossRef]
Kim, K.W.; Jeon, J.C. Polynomial Basis Multiplier Using Cellular Systolic Architecture. IETE J. Res. 2014, 60, 194–199. [Google Scholar] [CrossRef]
Choi, S.; Lee, K. Efficient systolic modular multiplier/squarer for fast exponentiation over GF(2^m). IEICE Electron. Express 2015, 12, 20150222. [Google Scholar] [CrossRef]
Reyhani-Masoleh, A. A new bit-serial architecture for field multiplication using polynomial bases. In Proceedings of the 7th International Workshop Cryptographic Hardware Embedded Systems (CHES 2008), Washington, DC, USA, 10–13 August 2008; pp. 300–314. [Google Scholar]
Abdulrahman, E.A.; h Reyhani-Masoleh, A. High-Speed Hybrid-Double Multiplication Architectures Using New Serial-Out Bit-Level Mastrovito Multipliers. IEEE Trans. Comput. 2016, 65, 1734–1747. [Google Scholar]
Kim, K.W.; Jeon, J.C. A semi-systolic Montgomery multiplier over GF(2^m). IEICE Electron. Express 2015, 12, 20150769. [Google Scholar] [CrossRef]
Ibrahim, A. Novel Bit-Serial Semi-Systolic Array Structure for Simultaneously Computing Field Multiplication and Squaring. IEICE Electron. Express 2019, 16, 20190600. [Google Scholar]
Kim, K.W.; Lee, J.D. Efficient unified semi-systolic arrays for multiplication and squaring over GF(2^m). Electron. Express 2017, 14, 20170458. [Google Scholar]
Kim, K.W.; Kim, S.H. Efficient bit-parallel systolic architecture for multiplication and squaring over GF(2^m). IEICE Electron. Express 2018, 15, 1–6. [Google Scholar]
Ibrahim, A. Efficient Parallel and Serial Systolic Structures for Multiplication and Squaring Over GF (2^m). Can. J. Electr. Comput. Eng. 2019, 42, 114–120. [Google Scholar]
Roman, S. Field Theory, 2nd ed.; Springer: New York, NY, USA, 1983. [Google Scholar]
Pillutla, S.R.; Boppana, L. Area-efficient low-latency polynomial basis finite field GF(2^m) systolic multiplier for a class of trinomials. Microelectron. J. 2020, 97, 104709. [Google Scholar] [CrossRef]
Imana, J.L. LFSR-Based Bit-Serial GF(2^m) Multipliers Using Irreducible Trinomials. IEEE Trans. Comput. 2020, 70, 156–162. [Google Scholar]
Pillutla, S.R.; Boppana, L. Low-latency area-efficient systolic bit-parallel GF(2^m) multiplier for a narrow class of trinomials. Microelectron. J. 2021, 117, 105275. [Google Scholar] [CrossRef]
Li, Y.; Cui, X.; Zhang, Y. An Efficient CRT-based Bit-parallel Multiplier for Special Pentanomials. IEEE Trans. Comput. 2021, 71, 736–742. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; He, W. Fast hybrid Karatsuba multiplier for type II pentanomials. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2459–2463. [Google Scholar] [CrossRef]
Meher, P.K.; Lou, X. Low-Latency, Low-Area, and Scalable Systolic-Like Modular Multipliers for GF(2^m) Based on Irreducible All-One Polynomials. IEEE Trans. Circuits Syst. I Regul. Pap. 2016, 64, 399–408. [Google Scholar] [CrossRef]
Mohaghegh, S.; Yemiscoglu, G.; Muhtaroglu, A. Low-Power and Area-Efficient Finite Field Multiplier Architecture Based on Irreducible All-One Polynomials. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Sevilla, Spain, 10–21 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Zhang, Y.; Li, Y. Efficient Hybrid GF(2^m) Multiplier for All-One Polynomial Using Varied Karatsuba Algorithm. IEICE Trans. Fundam. Electron. Comput. Sci. 2021, 104, 636–639. [Google Scholar] [CrossRef]
Zhou, B.B. A New Bit Serial Systolic Multiplier over GF(2^m). IEEE Trans. Comput. 1988, 37, 749–751. [Google Scholar] [CrossRef]
Fenn, S.T.J.; Taylor, D.; Benaissa, M. A Dual Basis Bit Serial Systolic Multiplier for GF(2^m). Integr. VLSI J 1995, 18, 139–149. [Google Scholar] [CrossRef]
Lee, C.Y.; Lu, E.H.; Lee, J.Y. Bit-Parallel Systolic Multipliers for GF(2^m) Fields Defined by All-One and Equally-Spaced Polynomials. IEEE Trans. Comput. 2001, 50, 358–393. [Google Scholar]
Lee, C.Y.; Lu, E.H.; Sun, L.F. Low-Complexity Bit-Parallel Systolic Architecture for Computing AB²+C in a Class of Finite Field GF(2^m). IEEE Trans. Circuits Syst. II 2001, 50, 519–523. [Google Scholar]
Lee, C.Y.; Chiou, C.W. Efficient Design of Low-Complexity Bit-Parallel Systolic Hankel Multipliers to Implement Multiplication in Normal and Dual Bases of GF(2^m). IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2005, E88-A, 3169–3179. [Google Scholar]
Lee, C.Y. Low-latency bit-pararallel systolic multiplier for irreducible x^m+xⁿ+1 with GCD(m, n) = 1. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2008, 55, 828–837. [Google Scholar]
Bayat-Sarmadi, S.; Farmani, M. High-Throughput Low-Complexity Systolic Montgomery Multiplication Over GF(2^m) Based on Trinomials. IEEE Trans. Circ. Sys.-II 2015, 62, 377–381. [Google Scholar]
Mathe, S.E.; Boppana, L. Bit-parallel systolic multiplier over GF(2^m) for irreducible trinomials with ASIC and FPGA implementations. IET Circuits Desvices Syst. 2018, 12, 315–325. [Google Scholar] [CrossRef]
Lee, C.Y.; Chiou, C.W.; Lin, J.M. Concurrent error detection in a polynomial basis multiplier over GF (2^m). J. Electron. Test. 2006, 22, 143–150. [Google Scholar]
Chiou, C.W.; Lee, C.M.; Sun, Y.S.; Lee, C.Y.; Lin, J.M. High-throughput Dickson basis multiplier with a trinomial for lightweight cryptosystems. IET Comput. Digit. Tech. 2018, 12, 187–191. [Google Scholar]
Lee, K. Resource and Delay Efficient Polynomial Multiplier over Finite Fields GF(2^m). J. Korea Soc. Digit. Ind.Inf. Manag. 2020, 16, 1–9. [Google Scholar]
Lee, K. Low Complexity Systolic Montgomery Multiplication over Finite Fields GF(2^m). J. Korea Soc. Digit. Ind. Inf. Manag. 2022, 18, 1–9. [Google Scholar]
Mathe, S.E.; Boppana, L. Design and Implementation of a Sequential Polynomial Basis Multiplier over GF(2^m). KSII Trans. Internet Inf. Syst. 2017, 11, 2680–2700. [Google Scholar]
Gebali, F. Algorithms and Parallel Computers; John Wiley: New York, NY, USA, 2011. [Google Scholar]
Ibrahim, A.; Gebali, F. Scalable and Unified Digit-Serial Processor Array Architecture for Multiplication and Inversion over GF(2^m). IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 22, 2894–2906. [Google Scholar]
Ibrahim, A.; Alsomani, T.; Gebali, F. New Systolic Array Architecture for Finite Field Inversion. IEEE Can. J. Electr. Comput. Eng. 2017, 40, 23–30. [Google Scholar] [CrossRef]
Chiou, C.W.; Lin, J.M.; Lee, C.Y.; Ma, C.T. Novel Mastrovito Multiplier over GF(2^m) Using Trinomial. In Proceedings of the 2011 5th International Conference on Genetic and Evolutionary Computing (ICGEC), Kitakyushu, Japan, 29 August–1 September 2011; pp. 237–242. [Google Scholar]
Ibrahim, A.; Gebali, F.; Bouteraa, Y.; Tariq, U.; Ahanger, T.; Alnowaiser, K. Compact Bit-Parallel Systolic Multiplier Over GF(2^m). IEEE Can. J. Electr. Comput. Eng. 2021, 44, 199–205. [Google Scholar]

Figure 1. RFID assistive system.

Figure 2. DG of computing variable U for

m = 5

.

Figure 2. DG of computing variable U for

m = 5

.

Figure 3. DG of computing variable V for

m = 5

.

Figure 3. DG of computing variable V for

m = 5

.

Figure 4. Node timing for U using

m = 5

.

Figure 4. Node timing for U using

m = 5

.

Figure 5. Node timing for V using

m = 5

.

Figure 5. Node timing for V using

m = 5

.

Figure 6. Semi-systolic bit-parallel multiplier structure.