Next Article in Journal
Dynamic DOA Estimation for UAV Arrays Using LEO Satellite Signals of Opportunity via Sparse Reconstruction
Next Article in Special Issue
A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management
Previous Article in Journal
Evaluating CLAP and MERT for Fine-Grained Cymbal Classification: A Multi-Stage Representation Analysis
Previous Article in Special Issue
Advanced Hardware Security on Embedded Processors: A 2026 Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing

by
Shivani Sharma
and
Darshika G. Perera
*
Department of Electrical and Computer Engineering, University of Colorado Colorado Springs, 1420 Austin Bluffs Parkway, Colorado Springs, CO 80918, USA
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(8), 1725; https://doi.org/10.3390/electronics15081725
Submission received: 7 March 2026 / Revised: 6 April 2026 / Accepted: 15 April 2026 / Published: 18 April 2026

Abstract

With the advent of AI and the smart systems era, neuromorphic computing will be imperative to support next-generation AI-related applications. Existing intelligent systems, (such as smart cities, robotics), face many challenges and requirements including, high performance, adaptability, scalability, dynamic decision-making, and low power. Neuromorphic computing is emerging as a complementary solution to address these challenges and requirements of next-gen intelligent systems. Neuromorphic computing comprises many traits, such as adaptive, low-power, scalable, parallel computing, that satisfies the requirements of future intelligent systems. There is a need for innovative solutions (in terms of models, architectures, techniques) for neuromorphic computing to support next-gen intelligent systems to overcome several challenges hindering the advancement of neuromorphic computing. In this research work, we introduce a novel and efficient FPGA-HLS-based hardware accelerator for the Generalized Hebbian learning algorithm (GHA) for neuromorphic computing applications. We decided to focus on GHA, since it was demonstrated that GHA enables online and incremental learning, and provides a hardware-efficient unsupervised learning framework that aligns closely with the principles of biological adaptation—traits that are vital for neuromorphic computing applications. In addition, our previous work showed that FPGAs have many features, such as low power, customized circuits, parallel computing capabilities, low latency, and especially adaptive nature, which make FPGAs suitable for neuromorphic computing applications. We propose two different hardware versions of FPGA-HLS-based GHA hardware accelerators: one is memory-mapped interface-based and another one is streaming interface-based. Our streaming interface-based FPGA-HLS-based GHA hardware IP achieves up to 51.13× speedup compared to its embedded software counterpart, while maintaining small area and low power requirements of neuromorphic computing applications. Our experimental results show great potential in utilizing FPGA-based architectures to support neuromorphic computing applications.

1. Introduction

With the advancement of Artificial Intelligence (AI), neuromorphic computing research and neuromorphic hardware architectures will be imperative to support next-generation AI-related applications. This is not only because neuromorphic computing is inspired by the structure and function of the human brain, but also due to many potential traits of neuromorphic computing hardware architectures, which include dense connectivity, parallel operations, low energy use, tight coupling of memory and processing, etc. [1,2]. With these traits, neuromorphic computing can be utilized to create intelligent systems that mimic the brain’s functionality, while addressing various limitations of next-gen AI-related applications. This in turn would create next-gen intelligent systems capable of perception, learning, smart and autonomous decision-making, merging biological principles with modern computing [1,2].
In recent years, there has been a growing interest in neuromorphic computing research and development in industry and academia. Major companies such as IBM and Intel introduced neuromorphic computing systems/platforms, including the TrueNorth processor [3] and Loihi [4], respectively. Neuromorphic computing concepts and designs are being explored and incorporated into various fields/applications such as robotics, AI, edge computing, healthcare, cybersecurity, smart cities, aerospace, to name a few [5,6,7,8,9,10]. Also, to reference the Allied Market Research report [11], the global neuromorphic computing market was valued at US$26.32 million in 2020, and is projected to reach $8583.98 million by 2030, with a CAGR (compound annual growth rate) of 79%. The above facts illustrate that neuromorphic computing market and research will continue to flourish over the long term, as new applications emerge, including next-gen AI.
Existing intelligent systems (e.g., robotics, AI, smart cities), face many challenges and requirements [5,6,7,8,9,10]. For instance [12,13,14], autonomous robots must interpret complex sensory data, while navigating dynamic environments; edge devices must process data locally with limited resources, such as bandwidth, power, area; wearable health monitors require exceptional precision; smart city applications, such as traffic and energy management, require processing data from many sources, with high efficiency and low cost. Most of these existing systems share common challenges [15,16], including high processing power, adaptability to varying environment, reliability, scalability, fast processing and analysis of data, on-the-fly accurate decision making, etc. Consequently, novel solutions are needed to address the challenges and satisfy the requirements of future intelligent systems.
Neuromorphic computing is emerging as a complementary solution to address the above challenges and requirements of next-gen intelligent systems. The brain-inspired architectures of neuromorphic computing systems enable adaptive, low-power, and event-driven computations suitable for real-time processing [17,18,19,20]. Neuromorphic computing systems can perform parallel processing for both learning and inference efficiently, by mimicking the brain’s network of interconnected neurons and synapses [21]. Furthermore, neuromorphic hardware architectures can be created in such a way to be adaptive, scalable, energy-efficient, and area-efficient (or small footprint) [18,21]. The aforementioned traits of neuromorphic computing would be able to satisfy the requirements of future intelligent systems and applications.
Although neuromorphic computing has many vital traits that would be beneficial for next-gen intelligent systems, there are several issues impeding the advancement of neuromorphic computing, including lack of robust development tools, limited programming/design framework, and hardware architectures that are limited to specific applications [22,23]. Hence, neuromorphic computing requires novel models, architectures, and techniques to support the requirements and challenges of future intelligent systems, including next-gen AI.
There are many models and algorithms, inspired by the brain’s function, that are utilized to create neuromorphic computing hardware architectures [1]. Some of the most prominent ones include [24,25,26,27]: spike-timing dependent plasticity (STDP), backpropagation through time (BPTT), spiking neural networks (SNNs), Generalized Hebbian learning algorithm (GHA), etc. From these, the Generalized Hebbian learning algorithm (GHA), often referred to as Sanger’s Rule, is an extension of Hebbian learning algorithm (HA), which is one of the most biologically realistic and ecologically valid learning processes [28,29]. As stated in [28], GHA enables online and incremental learning by computing principal components from high-dimensional input data through local weight updates. Furthermore, GHA provides a hardware-efficient unsupervised learning framework that aligns closely with the principles of biological adaptation [26]. As stated in [30], for GHA, the brain-like processing is reflected by locally updating each weight, online activity-driven learning, and using connected input and output neurons, while being updated in parallel. Therefore, in this research work, we decided to focus on the Generalized Hebbian learning algorithm (GHA) for neuromorphic computing platforms/applications.
Neuromorphic computing-related algorithms and models can be realized on various computing platforms, including central-processing units (CPUs), graphical processing units (GPUs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) [31,32]. These platforms have their own advantages and disadvantages. For instance, CPUs provide flexibility and ease of programming but have limitations in parallelism and processing of a large volume of data, which impedes scalability and processing of extensive workloads [31,33]. GPUs provide parallel computing and dense neural simulation, but consume substantial power, which is not suitable for highly constrained neuromorphic computing platforms/applications [32,33]. FPGAs provide on-chip reconfigurability, low-power, and high parallel computing, which supports real-time processing and analysis of neuromorphic computing algorithms, such as GHA and SNN [34].
From our previous work [35,36], it was revealed that FPGAs would be one of the best avenues to support and accelerate algorithms and models for neuromorphic computing platforms/applications, considering the associated constraints and requirements. FPGAs are being increasingly utilized to support and accelerate complex (i.e., compute/data-intensive) applications on resources constrained devices, such as embedded systems [37,38]. As stated in [39], FPGAs provide fine-grained parallelism, low power consumption, customized circuits for specific algorithms, resulting in high throughput. FPGAs’ reconfigurable nature facilitates the adaptive traits required for neuromorphic computing [40]. FPGAs’ support for a plug-and-play method for integrating custom IPs is important for adaptive learning in energy-constrained applications such as robotics, edge AI, etc. [41,42].
Considering the aforementioned, in this paper, our intention is to introduce a novel and efficient high-level synthesis (HLS) based FPGA hardware accelerator for the GHA for neuromorphic computing. From our investigation on existing works (in Section 2.3) and to the best of our knowledge, we did not find any FPGA-HLS-based hardware architectures for the GHA in the published literature, demonstrating the uniqueness of our contribution.
In this paper, we make the following contributions:
  • We introduce a novel, unique, and efficient FPGA-HLS-based hardware accelerator for the GHA for neuromorphic computing. Our proposed GHA hardware architecture is created in such a way to minimize the occupied on-chip hardware resources and maximize the speedup. Our GHA hardware accelerator is also flexible and scalable, thus can be utilized to process varying datasets with varying sizes, and can be efficiently deployed on any FPGA platform, leveraging HLS capabilities. These traits are imperative and beneficial for neuromorphic computing applications with many constraints, including limited hardware resources and energy constraints.
  • We introduce two different hardware versions for our FPGA-HLS-based architectures for the GHA: one with a memory-mapped interface, and another with a streaming interface. We compare these two hardware versions in terms of various performance metrics. Furthermore, we introduce embedded software architecture for the GHA to evaluate our proposed embedded hardware architectures.
  • We perform experiments to evaluate the feasibility and efficiency of both versions of our proposed FPGA-HLS-based GHA hardware accelerators using five different benchmark datasets. We analyze the timing, speedup, classification accuracy, error rate, convergence rate, hardware resource utilization, and power for our embedded hardware designs.
The organization of this paper is as follows: In Section 2, we discuss and present a brief description of neuromorphic computing, the GHA, as well as the analysis of the existing works on hardware architectures for GHA in the published literature. In Section 3, we present our design approach and development platform, including experimental platform, system-level architectures, and benchmark datasets utilized. We also present a brief description of our embedded GHA software architecture in Section 3. In Section 4, our proposed FPGA-HLS-based GHA hardware architectures for both memory-mapped interface-based and streaming interface-based designs are introduced. Our experimental results and analysis in terms of timing, speedup, area, accuracy, error rate, convergence rate, and power are reported and discussed in Section 5. In Section 6, we summarize our work, conclude, and discuss our future research directions.

2. Background and Existing Works

In this section, we present brief descriptions on neuromorphic computing and the Generalized Hebbian learning algorithm (GHA). We also discuss and present existing research works on hardware designs and implementations for GHA in the published literature.

2.1. Neuromorphic Computing

Neuromorphic computing as a concept that was first introduced by Carver Mead in 1990 [43]. Neuromorphic computing, inspired by the structure and function of the human brain, strives to create energy-efficient hardware systems capable of performing highly complex information processing tasks [44]. Also, neuromorphic computing mimics the structure and the logical processing of “biological neural systems”, providing efficient hardware architectures for artificial intelligence [20].
Based on a recent market analysis in [45], global neuromorphic computing chip market will reach approximately US $556.6 million by 2026. Furthermore, as stated in [46,47], early-stage commercial neuromorphic chips have already shown the ability to perform complex computational tasks at varying scales with extremely low power consumption and low latency. The aforementioned facts demonstrate that neuromorphic computing has tremendous potential for many embedded systems-related applications.
Compared to conventional computing systems, neuromorphic computing systems provide several unique advantages [16,21]: (1) tightly coupled memory and computing, which reduces data transfer or memory access latency; (2) sparse and distributed information encoding; (3) dynamic and local learning techniques, which reduce power consumption by avoiding global backpropagation across deep network layers; (4) predictive perception, which enables reaching stable interpretations; (5) multi-timescale dynamics, which allow adaptive real-time learning and signal processing.
Neuromorphic computing focuses on designing hardware that replicates the operational principles of the human brain. An illustration of a biological neuron, revised from [48,49] is depicted in Figure 1. As stated in [48,49], in biological neurons, electrochemical signals enter through small connection points known as synapses. As shown in Figure 1, these synapses are distributed across the surfaces of branch-like structures called dendrites, which are extend into the surrounding neural tissue to collect signals from their synapses, and transmit these signals to the neuron’s central region (i.e., the cell body) [48,49]. From the cell body, another elongated fiber known as the Axon carries outgoing signals into the neural network, ending at synapses on the dendrites of other neurons [48,49]. Biological neurons operate using a threshold principle [50,51]. When a signal arrives from a preceding neuron, it only affects the receiving neuron if it exceeds a certain level; otherwise, the neuron’s membrane potential remains unchanged [50]. This process helps conserve energy while filtering out irrelevant or insignificant signals [51]. Neuromorphic computing systems are created to emulate the aforementioned biological components and their functions.

2.2. Generalized Hebbian Learning Algorithm

The Generalized Hebbian learning algorithm (GHA) is an iterative technique that is used to extract the principal component (PC) vectors from the input data [28]. The GHA operates in such a way that closely emulates the process of unsupervised Hebbian learning, typically observed in biological and artificial neural systems [28,52].
Below, we summarize and present the GHA with the corresponding equations, whereas detailed description of GHA can be found in [28].
Let us consider a single-layer neural network consisting of n input neurons, and m output neurons denoted by y1, …, ym, where the connection strength between the jth input neuron and the ith output neuron is represented by the synaptic weight, ω i j . These synaptic weights form linear code vectors that define how information is transferred and transformed across the network [53].
Then, we can write the GHA in matrix form as follows [28]:
Δ ω i j =   η y i ( x j   k = 1 i ω k j y k )
where η is the learning rate parameter.
The above Equation (1) is derived from Equations (2) and (3), using Oja’s rule in matrix form, and using the Gram-Schmidt algorithm, respectively [28].
Using Oja’s rule in matrix form:
d ω ( t ) d t =   ω ( t ) Q d i a g [ ω ( t ) Q ω ( t ) T ] ω ( t )
Using the Gram-Schimdt algorithm:
Δ ω ( t ) =   l o w e r [ ω ( t ) ω ( t ) T ] ω ( t )
where ω(t) is a matrix, representing synaptic weights for both Equations (2) and (3).
Also, consider Equation (4) for the autocorrelation matrix [28].
Q = η X X T
where the outer product of inputs, and the “diag” (in Equation (2)) is the function that sets all matrix elements outside of the diagonal equal to 0 (zero), and “lower” (in Equation (3)) is the function that sets all matrix elements on or above the diagonal equal to 0 (zero).
By amalgamating Equations (2)–(4), the original Equation (1) of GHA can be written in the matrix form with Equation (5) as follows [28]:
Δ ω ( t ) =   η ( t ) ( y ( t ) x ( t ) T L T [ y ( t ) y ( t ) T ] ω ( t ) )
The GHA is utilized in different applications/fields, including Artificial Intelligence (AI), speech recognition, image processing, where the analysis of the features or principal components (PCs) is extracted and utilized for subsequent analysis [53]. Furthermore, due to the single layer processing nature of the GHA, the change in the synaptic weight only depends on the response of the inputs and the outputs of that layer, which in turn eliminates the multi-layered dependency, associated with backpropagation algorithms [53]. In Equations (1) and (5), the learning rate parameter η sets the tradeoff between the accuracy of convergence and the learning speed of the GHA [53].

2.3. Analysis on Existing Works on Hardware Architectures for GHA

We investigated the existing research works on FPGA-based high-level synthesis (HLS) architectures for GHAs in the published literature. From our investigation and to the best of our knowledge there is no existing FPGA-HLS-based architecture for the GHA in the published literature. Therefore, we investigated the existing works on various other hardware architectures for GHAs in the published literature. From these existing works, for our analysis, we selected the works on GHA hardware designs that are most relevant as well as that demonstrate the evolution of hardware designs for GHA. These existing works are discussed and presented in detail below and summarized in Table 1.
An FPGA-based hardware design was proposed for the Exponential Hebbian learning algorithm in [54]. Authors introduced nonlinearity by using a low-pass filter with a non-linear feedback loop, and also by utilizing exponential-shaped adaptation of weights that comprise different time constants for rising and falling. Authors implemented eight synapses in a single FPGA using a complete serial dataflow, which reduced the hardware resource consumption. The design consists of four conventional synapses with fixed weights, and four Hebbian synapses for exponential online learning. Experimental results and analysis, executed on Xilinx XC3090 FPGA, showed improved performance of the proposed design compared to a linear solution.
A hardware design was proposed in [55,56] for principal component analysis (PCA) based on the Generalized Hebbian learning algorithm (GHA), which was deployed on a system-on-programmable-chip (SOPC) platform. Authors proposed two different hardware architectures for GHA-based PCA. Architecture I comprised l number of modules, where each module j at time n computed the synaptic weight vector wj(n + 1), and each module j comprised m sub-modules. Architecture II comprised l number of stages, where each stage j encompassed modified (from Architecture I) submodules ji, where i ranges from 1 to m. In Architecture II, the updating of different synaptic weight vectors was divided into a number of stages; and the results of the precedent stages were utilized for the computation of subsequent stages, which in turn accelerated the training time and reduced the area. Experiments were performed on Altera Cyclone III FPGA using texture classification dataset. The proposed Architecture II achieved 89.82% classification success rate, and 986.73 ms execution time, hence, speedup of 32.28× compared to its CPU software counterpart. Authors concluded that proposed Architecture II was a better option for on-chip learning applications, which typically requires small area and high-performance computations.
In [57], an FPGA-based hardware design was proposed for a Hebbian Eigenfilter Spike Sorting algorithm, where PCA was computed via Hebbian learning. Authors utilized Hebbian Eigenfilter, since it eliminated the need for computationally expensive covariance analysis and eigenvalue decomposition in conventional PCA, and had the capability to filter a specified number of principal components, which in turn resulted in area-efficient hardware designs. The proposed system-level architecture comprised two major subsystems: a training module and a real-time processing module. The real-time processing module was created in such a way to be scalable and parameterizable; hence, identical architectures were employed to process real-time spike sorting in parallel. In addition, a folding technique was employed to share computing resources among the real-time parallel processing elements. Experiments were performed on Xilinx FPGA (Spartan 6 Low-Power XC6S-LX150L) using both synthetic and real clinical spike train data, in terms of throughput, accuracy and power consumption. Experimental results showed that spike classification accuracy was almost the same between the Hebbian Eigenfilter and conventional PCA-based approach. The Hebbian Eigenfilter hardware design achieved 17× and 33× speedup compared to its CPU software counterpart, in terms of learning latency and real-time projection latency, respectively. Authors concluded that proposed hardware design based on the Hebbian Eigenfilter led to less power and less hardware resource consumption, compared to other conventional spike sorting algorithms.
In [34], the same research group (or some authors from the same group) published another paper on the same (or similar) PCA based on GHA as in [55,56]. In [34], authors elaborated on internal hardware architectures for the three main modules for GHA-based PCA, i.e., synaptic weight vector updating (SWU) unit, principal component computation (PCC) unit, and memory unit. Experiments were performed on two FPGA development platforms, Altera Cyclone IV EP4CGX150DF31C7 FPGA and Altera Cyclone III EP3C120F780C8 FPGA, using several texture classification datasets, with varying data sizes. From the experimental results and analysis, it was observed that utilized hardware resources increased with the number of stages in Architecture II, as well as with the size of the dataset. The proposed Architecture II achieved above 90% classification success rate; 733.14 ms execution time, hence, speedup of 13.81× compared to its CPU software counterpart; and 0.129 W power consumption. Authors also compared the proposed hardware design with the related works in published literature. Authors concluded that the proposed architecture was an effective for on-chip learning applications requiring area-efficiency, high classification success rate, and high-performance computations.
An FPGA-based hardware architecture was proposed for spike sorting based on the GHA in [58]. Authors implemented GHA training in hardware using pipelining to enhance throughput. In this case, the GHA training hardware architecture for p-stage pipeline, produced p principal components for feature extraction, and each stage comprised a principal component computing (PCC) unit, a synaptic weight vector updating (SWU) unit, and a memory unit. The spikes were delivered to the pipeline one at a time. This work was also from the same research group (or some authors from the same group) as in [55,56], thus proposed architecture had some similarities, for instance, the three main modules of each stage: PCC, SWU, and memory unit. Experiments were performed on Altera Cyclone IV EP4CGX150 FPGA. The proposed hardware architecture achieved above 92% classification success rate for four target neurons, with signal to noise ratio (SNR) equaled to 10, and 2.91 ms execution time. Authors also compared the proposed hardware design with the related works in the published literature. Authors concluded that the proposed architecture was an efficient spike sorting hardware design for achieving small area and high-performance computations.
A network-on-chip (NOC) based hardware architecture was proposed for spike sorting based on GHA in [59]. This work was the continuation of work done in [58], which proposed spike sorting based on GHA using pipelining technique. This work was also from the same research group (or some authors from the same group) as in [56,57], thus proposed architecture had some similarities; for instance, the three main modules: PCC, SWU, and memory unit. An NOC-based hardware architecture was selected, since it was deemed suitable for enhancing transmission speed and throughput of spike sorting based on the GHA. The NOC-based hardware architecture for spike sorting the GHA was deployed on Altera Cyclone IV EP4CGX150DF31 to evaluate and perform the required experiments. The proposed hardware architecture achieved above 84% classification correct rate (CCR) with SNR equaled to 1, and above 96% CCR with SNR equaled to 10, for three target neurons. The proposed hardware architecture also achieved up to 1 GHz maximum operating frequency, which lead to 1.99 ms training (or execution time), thus 97.08× speedup compared to its CPU software counterpart. However, the proposed hardware architecture occupied more hardware resources on the chip, compared to the previous designs. Authors concluded that proposed architecture was an effective real-time training device for spike sorting based on the GHA.
In [60], a VLSI-based hardware architecture was proposed for multi-channel online spike sorting, where spike detection is based on nonlinear energy operator (NEO), and feature extraction is done by the GHA. Proposed architecture comprised three major operations: spike detection, spike alignment, and feature extraction. In this case, a NEO circuit was created to perform spike detection, one channel at a time. A spike buffer was created to hold the detected spikes generated by the NEO circuit, perform the alignment, and send the spikes to the GHA circuit as needed. A GHA circuit was created to perform feature extraction, where a single-channel GHA circuit comprised three main parts: buffer, sum-of-product (SOP) circuit, and synaptic weight vector updating (SWU) unit. The GHA circuit was created using 17-bit fixed-point numbers. The multi-channel GHA circuit was composed of single-channel ones, by sharing the SOP and SWU circuits with all the channels, thus reducing the area for each channel. A clock gating system was employed for all three operations to further reduce the power consumption, by supplying the clock only to the active channels. Proposed VLSI-based hardware architecture was implemented on an application-specific integrated circuit (ASIC) with 90 nm technology. From the experimental results and analysis, it was observed that occupied area increased with the increasing number of channels and increasing segment lengths. The proposed hardware architecture achieved above 97% classification success rate for two target neurons, with a signal to noise ratio (SNR) equaled to 6–10. In addition, for 64 channels and at a 2 MHz clock rate, the normalized power consumption reduced by 42.80%, i.e., from 150μW per channel without clock gating to 85.8μW per channel with clock gating. Authors concluded that the proposed VLIS-based hardware architecture was an effective solution to the applications, where spike sorting circuits with low power, small area, and high accuracy spike sorting were desired.

Summary

In summary: Our aforementioned investigation revealed that the above works on hardware architectures for GHAs were not created with neuromorphic computing in mind. In addition, most of these existing works were not done with a specific application in mind, although some works stated that their designs were suitable for on-chip learning applications. Also, these studies could potentially be utilized for certain applications based on the utilized datasets. Furthermore, the reported training times of some of these works are considered high and need to be optimized for real-time neuromorphic computing applications, although these works stated that their designs were created for real-time processing. Similarly, the reported occupied on-chip area (i.e., utilized hardware resources) of some of these works is also considered high and could be optimized and reduced for neuromorphic computing applications. The aforementioned works are summarized in Table 1.
Our investigation on these works revealed that there are no GHA hardware architectures using HLS on FPGAs in the published literature. Also, none of these works reported specific interfacing, i.e., either memory-mapped or streaming. In this research work, our intention is to create efficient FPGA-HLS-based GHA hardware architectures for real-time processing of neuromorphic computing platforms/applications considering the associated constraints. With our research work, we strive to address the above challenges/gaps identified in the existing works, by proposing two hardware versions for the GHA, by evaluating the proposed GHA hardware designs in terms of various performance metrics using multiple benchmark datasets, and by providing insight into associated tradeoffs.

3. Design Approach and Development Platform

3.1. Experimental Platform and Approach

In this research work, we introduce our novel and unique FPGA-HLS-based hardware accelerator/architecture for the Generalized Hebbian learning algorithm (GHA) for neuromorphic computing. We also introduce embedded software architecture for the GHA to evaluate our proposed FPGA-HLS-based GHA hardware architecture. In our previous work [61], we created desktop-based software architecture for the GHA, written in C and executed on a desktop computer using Microsoft Visual Studio 2022. We utilize our desktop C program to verify the results and correct functionality of both our embedded FPGA-HLS-based hardware and embedded software architectures for the GHA.
All our embedded FPGA-HLS-based hardware and embedded software experiments are performed on the AMD/Xilinx ZCU104 FPGA development board [62], comprising the Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC. This development board amalgamates a powerful processing system (PS) and programmable logic (PL) on the same FPGA device, where PS consists of the ARM flagship Cortex-A53 64-bit quad-core processor and ARM Cortex-R5 dual-core real-time processor [62,63]. The development board consists of substantial on-chip hardware resources, for instance, 504,000 system logic cells (comprising 460,800 CLB (configurable logic block) flip-flops, 230,400 CLB lookup tables), 1728 DSP slices, and 38 MB on-chip memory. The development board also comprises 464 I/O (input/output) pins, high-speed off-chip 4 GB DDR4-SDRAM (Double-Data-Rate Synchronous Dynamic Random Access Memory), and AXI interface [64]. More details of the ZCU104 FPGA development board and the corresponding components can be found in [62,63].
Our embedded FPGA-HLS-based hardware architectures/modules for the GHA are created using AMD/Xilinx Vivado 2024.1 and AMD/Xilinx Vitis 2024.1 design tools [65,66]. In this case, we use Vitis 2024.1 to create our GHA hardware module/IP and the corresponding system-level design; and use Vivado 2024.1 to incorporate GHA hardware module/IP to the system-level design and to generate the configuration bitstream to program the FPGA. All our embedded hardware modules are written in synthesizable C codes, which are the inputs to the aforementioned Vivado/Vitis design tools. Utilizing these tools, our high-level synthesizable C code goes through several steps, in order to finally create HLS-based hardware architectures/modules for the GHA. Our FPGA-HLS-based hardware modules for the GHA are executed on the Zynq UltraScale + MPSoC FPGA, running at 100 MHz, to verify their correctness and performance. Our embedded hardware modules/IPs are configured to have single-precision floating-point units (FPUs). We employ the integrated simulator, within the AMD/Xilinx Vivado/Vitis design suite environment, to verify the functionality of the designs before downloading and programming the FPGA.
All our embedded software architectures/modules for the GHA are written in embedded C codes and executed on the ARM Cortex-A53 hard core embedded processor [67] running at 1.2 GHz on the same FPGA. In this case, we utilize the standard optimization technique/option provided with AMD/Xilinx. We use AMD/Xilinx Vitis IDE 2024.1 design tools to design and develop embedded software architectures/modules (i.e., embedded C codes) executed on the ARM processor, as well as to verify the software modules [65,66]. Similar to our embedded hardware modules/IPs, our embedded software modules/architectures are also configured to have single-precision FPUs.

3.2. Our System-Level Architecture

We create system-level architectures for our proposed FPGA-HLS-based GHA hardware IP, as depicted in Figure 2 and Figure 3. During the design space exploration, we create two hardware versions for the GHA: one with a memory-mapped interface, and another with a streaming interface. Therefore, we create two system-level architectures for two GHA hardware versions. For our system-level architectures, we incorporate the processing system (PS) to perform high-level control functions, and initial data transfer from the desktop computer to the FPGA development board. For the memory-mapped interface version (in Figure 2), the PS comprises the ARM Cortex-A53 and memory controller. For the streaming interface version (in Figure 3), the PS comprises the ARM Cortex-A53, Direct Memory Access (DMA) controller, and memory controller. In both versions, the ARM processor performs high-level control functions, helps to initiate the execution of GHA IP, preprocess the input data, and process the output data to calculate the classification accuracy. Conversely, the PL executes the whole GHA, including the compute-intensive tasks, such as matrix multiplications, weight updates, etc. In both versions, a memory controller, manages access to the off-chip DDR4-SDRAM.
For both system-level hardware architecture versions, the AXI (Advanced Extensible Interface) bus acts as the glue logic for the system [64]. In our designs, the PS–PL interface employs the AXI-4 bus to facilitate high-speed data transfers and communication. This AXI-bus based PS–PL interface [68] also supports bidirectional data flow, which enables the ARM processor in PS to send/receive data to/from our FPGA-HLS-based GHA hardware IP in PL as well as the AXI Timer in PL. For our design, we configure our PS–PL interface to operate at 32-bit data width, to accommodate our single precision 32-bit FPUs and 32-bit floating point data, which in turn reduces the occupied hardware resources for our designs compared to that of higher data widths, such as 64-bits and 128-bits.
For both our system-level hardware versions as well as for our embedded software architecture, we incorporate the AXI-Timer (running at 100 MHz) to measure the execution times of both our embedded hardware modules/IPs and software modules/architectures [69]. Similar to our PS–PL interface, our AXI-Timer is also configured to operate at 32-bits data width, and is controlled by the ARM processor via AXI-4 bus.
For both our system-level hardware architecture versions, we use the ARM Cortex-A53 processor to monitor and control our FPGA-HLS-based GHA hardware IP. In this case, the ARM processor sends a flag (or trigger signal) to our GHA hardware IP to initiate the execution of GHA; starts the AXI-Timer to measure the execution time; receives a flag (or trigger signal) from our GHA hardware IP once the GHA is completed; and manages the dataflow between the PS and PL. The AXI-Timer starts measuring the execution time, as soon as the ARM processor sends a flag to our GHA hardware IP, and stops measuring the execution time as soon as the ARM processor receives a flag from our GHA hardware IP once the GHA is completed.
For the streaming interface version, the DMA controller facilitates high-speed data transfers between the off-chip DDR4-SDRAM and our custom GHA IP in the programmable logic (PL). These data transfers are also done via AXI-4 bus [64], to ensure efficient data flow. Due to high performance, the streaming interface version is suitable for real-time unsupervised learning applications, where high-speed and low-latency responses are critical.
By creating system-level architectures with both versions, we explore various performance metrics and associated tradeoffs, in terms of execution time, speedup, occupied hardware resources on chip, and power consumption. Our two system-level hardware architecture versions are detailed below.

3.2.1. Memory-Mapped Interface-Based Version

Initially, we create our system-level hardware architecture for our proposed FPGA-HLS-based GHA with a memory-mapped interface, as shown in Figure 2. This memory-mapped interface is implemented with AXI-4-Lite [64] and enables accessing custom hardware IPs as memory-mapped peripherals. With this interface, the data transfer is bidirectional and is transferred using standard read/write operations over designated addresses [64]. Although AXI-4 lite is suitable for low-bandwidth applications or control signals, it is less suitable for high-throughput data processing, due to limited bus latency and high addressing overhead [64]. We initially use memory-mapped interface, due to ease of implementation, smaller logic footprint of AXI4-Lite, and AXI4-Lite’s capability of handling single memory-mapped transactions.

3.2.2. Streaming Interface-Based Version

Next, we create our system-level hardware architecture for our proposed FPGA-HLS-based GHA with streaming interface, as shown in Figure 3. This streaming interface is implemented with AXI4-Stream protocol and enables direct and low-latency data transfers between hardware modules/IPs [64]. This interface supports unidirectional and handshake-based data transmission [64]. Unlike the memory-mapped interface, AXI4-Stream protocol eliminates the need for an address phase (thus, addressing overhead) and supports data bursts of unlimited lengths [64]. Although design and implementation of the streaming interface version is more complex than the memory-mapped version [64], the aforementioned traits make the streaming interface suitable for real-time, high-speed, and low-latency applications. Therefore, this version is more suitable for real-time processing of our FPGA-HLS-based GHA, with minimal buffering.

3.3. Our Benchmark Datasets

In this research, we utilize five different benchmark datasets to evaluate our embedded hardware and embedded software architectures for the GHA. All five datasets are obtained from the UCI (University of California Irvine) machine learning repository [70], which is widely adopted in machine learning research. These five datasets are selected from different fields and comprise varying data sizes and varying attributes. Each dataset consists of numerical attribute values.
The Wine benchmark dataset [71] is based on the chemical analysis of wine produced by three different vineyards from the same region in Italy. The Wine dataset consists of 178 instances (or vectors), each comprising 13 attributes. The chemical analysis determines the quantities of 13 ingredients (thus, 13 attributes) present in each of the three types of wines from 3 vineyards, which includes alcohol concentration, malic acid, ash content, alkalinity, magnesium, etc.
The Parkinson’s Disease benchmark dataset [72] is based on the biomedical voice measurement from 31 people, where 23 people have Parkinson’s disease. The Parkinson’s Disease dataset consists of 197 instances (or vectors), each comprising 23 attributes. These attributes represent different voice measurements, which can be used to distinguish healthy individuals from those with Parkinson’s disease. The vectors represent the recorded voice samples from the participants. The “status” attribute comprising 0 represents healthy subject, whereas 1 represents subject with Parkinson’s disease.
The Heart Disease benchmark dataset [73] is based on 14 clinical indicators such as age, gender, chest pain type, resting blood pressure, cholesterol level, maximum heart rate, etc. The Heart Disease dataset consists of 303 instances (or vectors), each comprising 76 attributes, although most reported studies have focused on 14 features. This dataset is commonly used for predicting whether an individual is likely to have heart disease. The “goal” attribute indicates the existence of heart disease in a patient, with integer values ranging from 0 to 4, where 0 represents the absence of heart disease, and values 1–4 represents the presence of severity of heart disease.
The Liver Disease benchmark dataset [74] is based on Native American Liver Patients, and contains five blood test variables, which are considered sensitive to liver issues that might have resulted from excessive alcohol intake. The Liver Disease dataset consists of 345 instances (or vectors), each comprising 5 attributes. In this case, each vector represents a single male subject, and 5 attributes include total/direct bilirubin, alkaline phosphatase, alanine aminotransferase levels, etc., aimed at diagnosing liver-related disorders.
The Breast Cancer benchmark dataset [75] is based on the digitized images of fine needle aspiration (FNA) samples of breast tissue. The Breast Cancer dataset consists of 286 instances (or vectors), each comprising 9 attributes. The attributes, derived from the digitized images, describe nuclear properties such as radius, texture, perimeter, area, smoothness, etc.
We decided to utilize five different datasets from different fields, with varying data sizes and varying characteristics, in order to evaluate the scalability and versatility of our proposed embedded hardware and software architectures for the GHA. Furthermore, in our previous work [61], we utilized these five benchmark datasets to analyze the GHA on a neuromorphic hardware platform, which demonstrated the suitability of these datasets for learning and classification tasks. Also, similar datasets from the UCI ML repository have been utilized for other neuromorphic computing research. For instance, in [76,77] authors utilized four (Iris, Wine, Breast Cancer and Optdigits) and five (Iris, Breast Cancer, Diabetes, Wine, Liver Disease) benchmark datasets, respectively, to analyze performance metrics of the Spiking Neural Networks (SNN) algorithm for neuromorphic computing. The summary of these five benchmark datasets is presented in Table 2.

4. Our Proposed FPGA-HLS-Based GHA Hardware Architectures

In this section, we discuss and present our novel and unique FPGA-HLS-based hardware architecture for our GHA for neuromorphic computing. We initially examine and analyze the functional flow of the GHA (presented in Section 4.4), in order to create HLS-based hardware architecture for GHA. Our design decisions, approaches, and techniques utilized for creating FPGA-HLS-based hardware for the GHA, are detailed below.
In this work, in order to expedite our hardware design and development process, we use Xilinx/AMD Vitis High-Level Synthesis (HLS), since HLS uses high-level programming languages, such as C/C++, to create FPGA-based hardware designs, instead of using low-level hardware description languages (HDL), such as Verilog or VHDL. This not only accelerates the hardware design and development process, but also enables the software designers to create FPGA-based hardware designs without extensive hardware design expertise.
We create two versions for our FPGA-HLS-based hardware architectures for the GHA: one with a memory-mapped interface, and another with a streaming interface. For both versions, the internal hardware architectures of our proposed FPGA-HLS-based GHA hardware IPs are quite similar in terms of functionality and data flow; however, there are some slight variations between the versions due to the differences in data transfer to and from the DDR4-SDRAM. For both versions, we utilize the embedded C code created for our embedded software GHA design, and convert it to synthesizable C code in order to create our FPGA-HLS-based GHA hardware IP.
In this research work, for our FPGA-HLS-based GHA hardware architectures, we employ several HLS directives (i.e., pragmas), including loop unrolling, pipelining, and interface specification, to optimize our designs. Typically, Vitis HLS supports various optimization techniques and directives to further enhance the efficiency of HLS-based hardware designs [78]. These optimization techniques enable us to leverage the inherent parallelism and pipelining traits of the FPGAs, significantly enhancing the performance metrics of our proposed FPGA-HLS-based GHA hardware architectures. In addition, the RTL (register-transfer level) designs generated by HLS tools (HLS-generated RTL designs) can easily be deployed and used on different FPGA devices with minimal modifications, hence, making HLS-based designs more generic.
During the design phase, the HLS-generated RTL designs for the GHA are configured and packaged as AMD/Xilinx Vivado IPs, and then incorporated into the system-level architectures as well as the Zynq SoC using the AXI interface. The AXI interface provides the required communication between the processing system (PS) and programmable logic (PL). As detailed in Section 3, our FPGA-HLS-based GHA hardware architectures are executed on a Zynq ZCU104 FPGA development board [62], comprising the Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC. For our designs, our HLS-based GHA hardware architectures are implemented on PL, whereas PS composed of the ARM processor is used to monitor and control the processing of GHA hardware. For instance, PS sends a signal to PL to start processing the GHA, and receives a signal from PL once the processing is completed; and PS also manages input data.

4.1. Vitis High-Level Synthesis (HLS) Process

We use Vitis HLS 2024.1 to create both our memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA hardware IPs, based on the Vitis HLS design flow in Vitis HLS user guide [79]. Our GHA hardware IPs/modules for both versions are written in synthesizable C/C++ codes. For both versions of our designs, we use Vitis HLS to generate the RTL designs/codes from these synthesizable C/C++ codes as the inputs. Next, we export these RTL designs/codes as the Vivado IP cores (or as Vitis Kernels). These Vivado IP cores are deployed in the programmable logic (PL) region of the Zynq UltraScale + MPSoC FPGA. Then, we verify the functionality of the GHA hardware modules using the five benchmark datasets detailed in Section 3.3. In this case, Vitis HLS generates the RTL designs/codes in such a way to meet the timing constraints of the target FPGA device, which in turn enables reusing the proposed designs on different FPGA devices.
The Vitis HLS design suite has various compiler directives (or pragmas), such as loop pipelining, loop unrolling, and function inlining, that can be incorporated to enhance various performance metrics (including latency, throughput, occupied hardware resources, etc.) of the proposed designs [66,79]. For instance, loop pipelining enables concurrent execution of successive iterations of a loop, improving the throughput of computational kernels like those found in the GHA. Similarly, loop unrolling replicates loop bodies, allowing simultaneous computation of multiple elements in vector-matrix operations. Therefore, we optimize our synthesizable C/C++ codes for both versions of our GHA hardware IPs, using various HLS directives/pragmas (including loop pipelining and unrolling), available with the Vitis HLS design tools, in order to achieve the best latency, throughput, and hardware resource utilization.
Specifically for the streaming interface-based FPGA-HLS-based GHA hardware IPs, we use the HLS streaming library to stream input and output data, in order to further enhance the speedup. In this case, we employ the AXI-Stream protocol and the AXI-Stream FIFO to provide the streaming interface in order to achieve the maximum throughput.
Both versions of our embedded hardware IPs/modules are executed on Zynq UltraScale+ ZCU104 FPGA board with XCZU7EV-2FFVC1156 FPGA. Since our FPGA is running at 100 MHz, our designs’ target clock frequency is also 100 MHz (i.e., clock period of 10 ns). As stated in [79], the HLS design suite has two “flow target” options: (1) Vivado IP flow target, and (2) Vitis kernel flow target. The flow target determines the interface ports applied to the synthesized design. From these two, we employ the “Vivado IP flow target” for creating our designs. The HLS design tools generate the RTL designs/codes in both Verilog and VHDL. Using the HLS packaging feature, we export the RTL designs/codes for both hardware versions as the Vivado IPs, enabling these IPs to seamlessly integrate with the Zynq processing system (PS).

4.2. Vitis HLS GHA Hardware IP with Memory-Mapped Interface

Initially, we create our embedded FPGA-HLS-based GHA hardware architecture using Vitis HLS with a memory-mapped interface. High-level architecture of our FPGA-HLS-based GHA hardware using memory-mapped interface is depicted in Figure 4. In this design, our customized and optimized GHA hardware IP core exposes its control, input, and output signals as addressable registers, allowing the ARM Cortex-A53 processor to interact directly with our GHA hardware IP core through registers for reads and writes. This direct communication method simplifies software control and debugging, by using simple memory operations, instead of complex data management protocols.
The internal architecture of our FPGA-HLS-based GHA hardware IP is further improved with HLS optimization directives, such as loop pipelining and unrolling, which in turn enhances the computational throughput and reduces the latency per iteration. The memory-mapped interface-based design is suitable for applications with moderate data throughput and for applications where precise control over execution steps is necessary. However, the dependency on register-level handshaking introduces addressing overhead, when processing large volume of data, which can limit the overall performance of the GHA hardware design.
As depicted in Figure 4, the memory-mapped interface-based GHA design is integrated seamlessly with the Zynq Processing System (PS) via AXI interconnects, supported by the clock synchronization and reset modules. Performance profiling for our GHA hardware IP is obtained from the AXI Timer, which provides accurate measurements of execution cycles. This illustrates that the memory-mapped interface-based design provides moderate throughput suitable for less data-intensive embedded computing tasks/algorithms, although this method is efficient and easy to develop and integrate.

4.3. Vitis HLS GHA Hardware IP with Streaming Interface

Our experimental results and analysis on the memory-mapped interface-based design reveal that this design has throughput and speed-performance limitations. In order to overcome these limitations, we create our next version of embedded FPGA-HLS-based GHA hardware architecture using Vitis HLS with a streaming interface. High-level architecture of our FPGA-HLS-based GHA hardware using streaming interface is depicted in Figure 5.
Our streaming interface-based GHA IP is created in such a way to interface with the DDR4-SDRAM via the AXI Direct Memory Access (DMA), where DMA provides efficient data transfer between our GHA IP and SDRAM using the AXI-Stream protocol. This streaming interface has many advantages, which includes: (1) allowing continuous dataflow through our proposed GHA hardware IP without unnecessary memory access, thus significantly enhancing throughput and reducing latency of our GHA IP; (2) enabling on-the-fly processing of data, while avoiding memory access bottleneck. These advantages enable our proposed FPGA-HLS-based GHA hardware IP to handle big datasets efficiently and effectively, without the need to store all the input data. These advantages also make our streaming interface-based GHA hardware IP suitable for real-time processing of big data.
Analogous to the memory-mapped interface-based design, the internal architecture of our streaming interface-based GHA hardware IP is also further improved with HLS optimization directives, such as loop pipelining and unrolling, which in turn enhances the computational throughput and reduces the latency per iteration. In addition, the use of the AXI-Stream protocol in conjunction with the DMA enables our streaming interface-based GHA IP to handle large datasets with better scalability and performance.
As depicted in Figure 5, the streaming interface-based GHA design is integrated with the DMA, Streaming FIFO, Zynq PS, etc., via AXI interconnects. Our experimental results and analysis (in Section 5) demonstrate that our streaming interface-based GHA IP outperforms the memory-mapped interface-based GHA IP in terms of speedup and throughput, making the streaming interface-based design a more desirable option for neuromorphic computing applications that require real-time and high-volume data processing and analysis. However, there is a tradeoff between design complexity and performance improvement, with our proposed FPGA-HLS-based GHA hardware IP with a streaming interface.

4.4. GHA Functional Flow: Embedded Hardware and Embedded Software

The internal architectures of our embedded FPGA-HLS-based GHA hardware architectures for both the versions as well as our embedded software GHA architecture are inspired by the Generalized Hebbian learning algorithm (GHA) in [28]. The GHA comprises several steps. These steps as well as the functional flow of our embedded hardware and the software GHA designs are presented in Algorithm 1 [28].
Based on the GHA learning rule, in a trained network, each output neuron corresponds to an eigenvector, and the outputs are typically arranged in descending order of their associated eigenvalues [28]. As illustrated in Algorithm 1, the training process begins by randomly initializing the weight matrix C, which is iteratively refined to approximate the eigenvectors of the input correlation matrix [28]. The algorithm runs through multiple training epochs; and during each epoch, it processes every input vector x in the dataset. For each input, the output vector y is obtained by computing the dot product of C and x [28].
Algorithm 1. Generalized Hebbian Learning Algorithm [28]
DATA:   UCI ML repository Dataset loaded from the
         TensorFlow library
STEP 1:  Initialize the weight matrix C with small random values.
         C (0) ∈ R(M×N)
Here, M is the number of output neurons, and N is
         the number of input neurons with M < N
STEP 2:  For each input training vector x ( t ) and its
         respective target output t, do steps 3–6
STEP 3:  Compute the output vector y(t):
         y(t) = C(t) x(t)
STEP 4:  Compute the outer products:
         y(t)xT (t) and y(t)yT (t)
STEP 5:  Set all the elements above the diagonal of y(t)yT (t) to zero
         Making it lower triangular: LT [y(t)yT (t)]
STEP 6:  Update the weight matrix C(t) using GHL rule:
          C(t + 1) = C(t) + η (y(t)xT (t) − LT [y(t)yT (t)]C(t))
         where η is the learning rate
The weight update step, which is the last step (step 6), is vital to the GHA. This step involves computing the outer product of y and x, along with the cross-product of the transpose of C and y [28]. These computations would modify C in such a way to align the weights with the most significant eigenvectors. After completing the training cycles, the GHA outputs the final weight matrix C, which captures the leading M eigenvectors of the input data’s correlation structure [28]. This iterative learning process allows the network to adaptively discover key data patterns. A convergence theorem ensures that the GHA reliably arrives at the correct eigenvectors without needing to precompute the full correlation matrix Q [28].
Our proposed embedded GHA hardware architectures are created in such a way to incorporate certain biological traits of the GHA in [28], including local synaptic learning rules and activity-dependent weight updates, by utilizing the GHL rule (i.e., step 6 in Algorithm 1). As demonstrated in Algorithm 1, step 6 consists of two biological components/terms: first term supports synapses based on correlated pre- and post-synaptic activity; and second term supports the learning process, where each weight update depends on the input training vector x(t) and output vector y(t). This GHL rule preserves the adaptation behavior of biological neurons, where synaptic strengths evolve gradually over time, based on the streaming inputs [28].

5. Experimental Results and Analysis

We perform experiments to evaluate the feasibility and efficiency of our proposed FPGA-HLS-based embedded hardware architectures/accelerators for the GHA (for both the memory-mapped interface and the streaming interface versions) for neuromorphic computing. We utilize our proposed embedded software architecture for the GHA to evaluate both versions of our proposed embedded FPGA-HLS-based GHA hardware architectures. Furthermore, we employ our previously introduced software architecture for the GHA [61], executed on desktop computer, to verify the results and correct functionality of both versions of our embedded FPGA-HLS-based GHA hardware IPs as well as embedded software architecture for the GHA.
All our embedded hardware and embedded software experiments are performed on the Xilinx/AMD ZCU104 FPGA development board, composed of Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC. Our proposed embedded FPGA-HLS-based GHA hardware architectures are executed on the Zynq UltraScale + MPSoC FPGA, running at 100 MHz, whereas our proposed embedded software GHA architecture is executed on the ARM Cortex-A53 hard core embedded processor, running at 1.2 GHz on the same FPGA.
From our experiments, we obtain various performance metrics/results, including execution time (thus, speedup), classification accuracy, convergence rate, error rate, scalability, occupied hardware resources, and power consumption. The scalability metric is to demonstrate our embedded designs’ ability to handle different datasets with varying data sizes, varying number of attributes, and other differing parameters that are commonly found in many datasets for neuromorphic-computing-related applications.
Our experimental results and analysis for both versions of embedded hardware GHA IPs as well as embedded software GHA are presented in the following sub-sections. These experimental results are post-implementation results generated by Vivado 2024.1. It should be noted that our embedded hardware and embedded software results are obtained in real-time, while these designs are actually running on the Zynq FPGA, whereas the desktop software results are obtained through simulation on a desktop computer.
Considering our investigation on existing works (presented in Section 2.3), it was revealed that there are no GHA hardware architectures using HLS on FPGAs in the published literature. Hence, in this paper, we do not report any direct performance comparisons with the existing works on GHA hardware designs.
In this research work, experiments are performed to evaluate our embedded hardware and embedded software architectures on five different benchmark datasets: Wine [71], Parkinson’s Disease [72], Heart Disease [73], Liver Disease [74], and Breast Cancer [75]. These input datasets are initially stored in the off-chip DDR4-SDRAM memory, and then forwarded to the hardware modules/IPs using the AXI bus. We use the AXI-Timer to measure the execution times of both versions of our embedded hardware IPs and embedded software module/architecture.
All our experiments (for both the embedded hardware and software) are repeated 10 times not only to ensure the reliability of the results, but also to average the results. For initialization of the weights for the GHA, we utilize random initialization using uniform distribution, similar to [80,81]. Also, we perform Design Space Exploration (DSE) to determine the optimal values for the hyperparameters of the GHA. From our DSE, we select the following optimal values for our hyperparameters: learning rate as 0.01 and number of training epochs as 100. These values give stable learning and good performance.

5.1. Analysis on Classification Accuracy, Convergence Rate, and Error Rate: Embedded Hardware GHA

In this subsection, we present our experimental results and analysis for our embedded hardware GHA architectures, in terms of classification accuracy, convergence rate, and error rate.

5.1.1. Performance Comparison: Classification Accuracy

As stated in [82], the classification accuracy is a performance metric that reflects the amount of errors contained in a classified data set and indicates the classification’s fitness on the features extracted by a specific algorithm, in our case, the GHA. Classification accuracy serves as an indicator of how effectively the algorithm captures meaningful and discriminative representations from the raw input data and is commonly used to evaluate the performance of the unsupervised learning approaches [82].
To obtain classification accuracy, we partition the datasets into two sets: training and testing. The test set is considered as a percentage of the dataset to investigate the classification accuracy and the error rate. Then, we perform three sets of experiments: (1) firstly, with 70% of data for the training set, and with 30% of data for the testing set; (2) secondly, with 30% of data for the training set, and with 70% of data for the testing set; (3) thirdly, with 50% of data for the training set, and with 50% of data for the testing set. These experimental results demonstrate that the second set (with only 30% of data for the training set) still achieves high classification accuracy.
Therefore, in this paper, we have decided to report the classification accuracy results with the second set (i.e., 30% of data for the training set, and with 70% of data for the testing set) in Table 3 (column 2). In this case, both our embedded hardware GHA architectures achieve the same classification accuracy results as our embedded software GHA architecture. The identical classification accuracy results between the embedded hardware and embedded software GHA designs can be attributed to the utilization of single-precision FPUs for all our embedded hardware/software modules/IPs.
In Table 3 (column 2), we report the classification accuracy results for our proposed embedded FPGA-HLS-based GHA architectures for the memory-mapped interface-based one as well as for the streaming interface-based one. As depicted, the classification accuracy results are the same for both versions. From Table 3 (column 2), it is observed that classification accuracy is quite high for four datasets (i.e., Wine, Parkinson’s Disease, Heart Disease, and Liver Disease), whereas classification accuracy is significantly lower for the Breast Cancer dataset. Our intention in selecting diverse range of datasets (i.e., five different datasets, with varying data sizes and varying traits) is to demonstrate the suitability (or unsuitability) of the GHA for different applications/fields corresponding to different datasets. These results indeed demonstrate that classification accuracy, generated from our embedded hardware and software GHA architectures, vary based on the datasets and the corresponding applications/fields.

5.1.2. Performance Comparison: Convergence Rate and Error Rate

As stated in [83], the convergence rate refers to as, how quickly the algorithm’s weights approach the principal subspace of the input data. Convergence rate primarily depends on the learning rate and the eigenvalue separation of the data covariance matrix [83]. With an optimized learning rate, the Generalized Hebbian Learning (GHL) algorithm converges linearly near the optimal solution, efficiently extracting the GHA components [28,83]. As stated in [84], “error rate refers to a measure of the degree of prediction error of a model made with respect to the true model”. It is typically the complement of the classification accuracy. In this research work, the convergence rate is measured using Equation (6) below [83,85]:
C o n v e r g e n c e   R a t e ( % ) = { | | C ( o p t ) C ( t ) | | F | | C ( o p t ) C ( t 1 ) | | F   } × 100
where C(opt) is the optimal weight matrix, which corresponds to the eigenvectors of the input correlation matrix, and C(t) is the weight matrix [28,85], both of which are derived from Algorithm 1.
Similar to our classification accuracy results, the convergence rate and error rate reported in this paper (in Table 3, columns 3 and 4, respectively) are also obtained with the second set (i.e., 30% of data for the training set, and with 70% of data for the testing set). In this case also, both our embedded hardware GHA architectures achieve the same convergence rate and error rate results as our embedded software GHA architecture.
In Table 3 (columns 3 and 4), we report the convergence rate and error rate results for our proposed embedded FPGA-HLS-based GHA architectures for the memory-mapped interface-based one as well as for the streaming interface-based one. Analogous to classification accuracy results, the convergence rate and error rate results are the same for both versions. From Table 3 (column 3), it is observed that convergence rate is quite high for all five datasets. From Table 3 (column 4), it is observed that error rate is quite low for four datasets (i.e., Wine, Parkinson’s Disease, Heart Disease, and Liver Disease), whereas error rate is significantly higher for the Breast Cancer dataset.
These results demonstrate that error rates, generated from our embedded hardware and software GHA architectures, vary based on the datasets and the corresponding applications/fields. From Table 3 (columns 2 and 4), there is a correlation between the classification accuracy and the error rate, i.e., the higher the classification accuracy, the lower the error rate, and vice versa. This is expected since the error rate is the complement of the classification accuracy. These results also demonstrate high convergence rates, for our embedded hardware and software GHA architectures, across all the datasets, indicating efficient training with these five datasets.

5.2. Analysis on Space, Execution Time, and Speedup: Embedded Hardware GHA

In this subsection, we present our experimental results and analysis for our embedded hardware GHA architectures, in terms of hardware resource utilization (i.e., occupied space), execution time, and speedup.

5.2.1. Performance Comparison: Hardware Resource Utilization (Space)

We perform space analysis on both our memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA IPs to gain insight into the impact of varying datasets and varying data sizes, on the total on-chip hardware resources utilized for the FPGA-HLS-based designs. We utilize the same five datasets to perform our space (or resource utilization) experiments. We present the total available hardware resources on Zynq ZCU104 FPGA development board in Table 4.
After the Vivado implementation process, we obtain significant resource utilization parameters, including the number of LUTs (look-up-tables), number of on-chip BRAMs (block RAM memory), number of FF (flip-flops), and number of DSP48 slices. These experimental results for our memory-mapped interface-based FPGA-HLS-based GHA IP and streaming interface-based FPGA-HLS-based GHA IP are presented in Table 5 and Table 6, respectively, for all five datasets. The percentage from the total available hardware resources (from Table 4) is also included in Table 5 and Table 6.
From Table 5 and Table 6, it is observed that the Wine dataset occupies the highest number of LUTs, FFs, and DSP48 slices for both versions of our FPGA-HLS-based GHA IPs, whereas the Parkinsons’ Disease dataset occupies the least number of LUTs, FFs, and DSP48 slices for both versions of our FPGA-HLS-based GHA IPs. These experimental results and analysis show that the occupied space (i.e., utilized hardware resource) for HLS-based GHA hardware architectures are impacted by the varying datasets and the corresponding applications/fields.
With all five datasets, it is also observed that the streaming interface-based GHA hardware IP (in Table 6) utilizes slightly less hardware resource than the memory-mapped interface-based GHA hardware IP (in Table 5), especially in terms of LUTs and FFs, whereas the DSP48 slice usage is the same for both the versions, and BRAM usage is slightly higher for the streaming interface-based one. This improved hardware resource efficiency (in terms of LUTs and FFs) can be attributed to the streaming interface-based design’s use of pipelining, resource sharing, and continuous dataflow, which optimizes hardware resource utilization, reducing hardware overhead. Similar DSP48 slice usage can be attributed to the dependency of DSP48 slices on the GHA computations, rather than the data transfer methods employed. This makes the streaming interface-based FPGA-HLS-based GHA IPs more suitable for resource-constrained (or small footprint) embedded neuromorphic computing devices/applications.

5.2.2. Performance Comparison: Execution Time and Speedup—Embedded Hardware vs. Embedded Software

We create embedded software architecture for GHA to evaluate our proposed FPGA-HLS-based hardware architectures for GHA. Our embedded software architecture for the GHA is executed on an ARM Cortex-A53 embedded microprocessor running at 1.2 GHz on the Zynq FPGA, whereas both our embedded hardware architectures for GHA are executed on the same Zynq FPGA running at 100 MHz.
The execution times for both versions of our embedded GHA hardware and our embedded GHA software designs are obtained using the AXI-Timer running at 100 MHz on the ZCU104 FPGA development board. These execution times are measured in real-time, while our embedded designs are actually running on the chip.
For embedded hardware versus embedded software performance comparisons, we utilize the same five datasets composed of varying data sizes and varying traits. The experimental results for both versions of our embedded GHA hardware IPs and embedded GHA software are presented in Table 7. In Table 7, the execution times for embedded software for the GHA are presented in column 2, and the execution times for memory-mapped interface-based and streaming interface-based embedded hardware IPs for GHA are presented in columns 3 and 4, respectively, for all five datasets. In this case, the execution times for each dataset (for both the embedded hardware and software) are measured 10 times and the average is presented.
From Table 7 (column 6), our streaming interface-based FPGA-HLS-based GHA hardware achieves up to 51.13× speedup compared to our embedded software counterparts, and the streaming speedups vary from 30.11× to 51.13×, with varying datasets. From Table 7 (column 5), our memory-mapped interface-based FPGA-HLS-based GHA hardware achieves up to 1.26× speedup compared to our embedded software counterparts, and the memory-mapped speedups vary from 1.00× to 1.26×, with varying datasets.
These experimental results and analysis demonstrate that the streaming interface-based design achieves very high speedup compared to the memory-mapped interface-based design. In this case, the streaming method leverages AXI-DMA and AXI-Streaming protocol, which significantly reduces the memory access latency and execution time, thus enhancing the speedup of the overall hardware design.
The speedup results of the streaming interface-based design are promising and validate the efficiency of our proposed FPGA-HLS-based GHA hardware. The lower latency of the streaming method allows for much higher throughput, making it well-suited for real-time in situ neuromorphic computing as well as machine learning applications, where rapid processing and analysis of big data is imperative.

5.3. Analysis on Power Consumption: Embedded Hardware GHA

In this subsection, we present our experimental results and analysis for our embedded hardware GHA architectures, in terms of power consumption.

5.3.1. Performance Comparison: Dynamic and Static Power Consumption

We perform power analysis on our memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA hardware IPs to gain insight into the impact of varying datasets and varying data sizes, on the total power consumption for the FPGA-HLS-based designs. We utilize the same five datasets to perform our power consumption experiments.
For all our power experiments presented in this paper, we utilize the Xilinx/AMD Power Design Manager (PDM) [86] to measure the estimated power consumption for our proposed FPGA-HLS-based GHA hardware IPs, executed on the Zynq FPGA. Xilinx PDM is a new power tool designed to provide accurate and consistent power estimation capabilities for the state-of-the-art Xilinx/AMD FPGAs, including Xilinx/AMD Versal AI and UltraScale+ FPGA devices.
Apart from the total on-chip power consumption, this Xilinx/AMD PDM enables us to obtain dynamic power and static power consumption results separately for our FPGA-HLS-based GHA hardware IPs. These experimental results for our memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA hardware IPs for all five datasets are presented in Table 8, and illustrated in Figure 6 and Figure 7, respectively.
From Table 8 and Figure 6 and Figure 7, it is observed that the Breast Cancer dataset consumes the highest power consumption for both versions of our FPGA-HLS-based GHA IPs, whereas the Liver Disease dataset consumes the least power consumption for both versions of our FPGA-HLS-based GHA IPs. From Table 8 (columns 3 and 6), the static power consumption remains almost the same (i.e., 0.693–0.694 W) for both of our FPGA-HLS-based GHA IPs across all five datasets. From Table 8 (column 2), the dynamic power consumption for our memory-mapped interface-based FPGA-HLS-based GHA hardware IP varies from 2.951 to 3.018 W, with varying datasets. From Table 8 (column 5), the dynamic power consumption for our streaming interface-based FPGA-HLS-based GHA hardware IP varies from 2.854 to 2.890 W, with varying datasets. These experimental results and analysis show that the on-chip power consumption for HLS-based GHA hardware architectures is impacted by the varying datasets and the corresponding applications/fields.
With all five datasets, from Table 8 and Figure 8, it is also observed that our streaming interface-based GHA hardware IP (column 7) consumes less power than the memory-mapped interface-based GHA hardware IP (column 4), especially in terms of dynamic power consumption. This increase in power efficiency can be attributed to the streaming interface-based design’s optimized hardware resource utilizations, optimize and continuous dataflow, and elimination of register access overhead. This makes our streaming interface-based FPGA-HLS-based GHA hardware IP more suitable for power-constrained embedded and portable systems, including embedded neuromorphic computing platforms/applications.

5.3.2. Performance Comparison: Power Consumption Breakdown—PS vs. PL

Next, we distinguish the power consumption breakdown results between the Processing System (PS) versus the Programmable Logic (PL) for both our memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA hardware IPs. In this case, we utilize only the Wine dataset to demonstrate and distinguish the power consumption breakdown results.
These power consumption breakdown results are presented in Figure 9 and Figure 10, for memory-mapped interface-based and streaming interface-based FPGA-HLS-based GHA hardware IPs, respectively. From Figure 9, it is observed that PS consumes 2.643 W out of 2.962 W of dynamic power for the memory-mapped interface-based design, i.e., 89.23% of dynamic power consumption. From Figure 10, it is observed that PS consumes 2.639 W out of 2.860 W of dynamic power for the streaming interface-based design, i.e., 92.27% of dynamic power consumption. Considering the overall on-chip power consumption (from Figure 9 and Figure 10), PS consumes 75.02% and 77.08% of total on-chip power consumption, for the memory-mapped interface-based design and the streaming interface-based design, respectively. Similar results are obtained with other datasets.
These experimental results and analysis show that the majority of the on-chip power consumption is due to the PS and not due to the PL. This can be attributed to the high operating frequency (i.e., 1.2 GHz) of the ARM Cortex-53 microprocessor comprised in PS. These results illustrate that in real-world scenarios, we can compose more power-efficient embedded neuromorphic computing devices by utilizing only the FPGAs (i.e., PL) and removing high-performance embedded microprocessors from our designs.

6. Conclusions and Future Work

In this paper, we introduced a novel and efficient FPGA-HLS-based GHA hardware accelerator for neuromorphic computing platforms/applications. We created two different hardware versions for FPGA-HLS-based GHA: one using a memory-mapped interface, another using a streaming interface. For both hardware versions, we utilized several available HLS-based hardware optimization techniques to enhance the performance metrics, including speedup and area-efficiency. Especially for the streaming interface-based hardware version, we utilized HLS streaming library to further enhance performance metrics, especially speedup, of our GHA hardware architecture.
We performed experiments to evaluate the feasibility and efficiency of both versions of our proposed FPGA-HLS-based GHA hardware accelerators in terms of area, timing, speedup, classification accuracy, error rate, convergence rate, and power. All our experimental results were obtained while our embedded hardware/software designs were actually running on the chip. For our experiments, we utilized five benchmark datasets from different fields, with varying data sizes and varying number of attributes to evaluate the above performance metrics as well as the scalability metric. The scalability metric would determine our proposed GHA hardware accelerator’s ability to adapt to different datasets in many neuromorphic computing applications.
From our experimental results and analysis, considering the execution time and speedup, our streaming interface-based FPGA-HLS-based GHA IP’s speedup varied from 31.11× to 51.13× compared to our embedded software counterparts with varying datasets; whereas our memory-mapped interface-based FPGA-HLS-based GHA IP’s speedup varied from 0.46× to 2.30× compared to our embedded software counterparts with varying datasets. In these cases, all our hardware designs were executed on the Zynq FPGA running at 100 MHz, whereas our embedded software designs were executed on the ARM Cortex-A53 embedded processor running at 1.2 GHz on the same FPGA. These results demonstrate that speedup increases dramatically with the streaming interface-based hardware version. These speedup results, with the streaming interface-based hardware version, are promising, and justify the efficiency of our proposed FPGA-HLS-based GHA hardware IPs for neuromorphic computing fields/applications.
Considering the occupied area (or hardware resource utilization), our streaming interface-based FPGA-HLS-based GHA hardware IP occupied less area on chip compared to our memory-mapped interface-based GHA hardware IP, especially in terms of LUTs and FFs, whereas DSP48 slice usage is the same for both the versions, and BRAM usage is slightly higher for the streaming interface-based one. Considering the power consumption, our streaming interface-based FPGA-HLS-based GHA IP consumed less power than the memory-mapped interface-based FPGA-HLS-based GHA hardware IP, especially in terms of dynamic power consumption. The space and power results also showed that occupied space and consumed power for HLS-based GHA hardware IPs are impacted by the varying datasets and the corresponding applications/fields. Considering the classification accuracy and error rate, both versions of our proposed FPGA-HLS-based GHA IPs achieved the same classification accuracy, error rate, and convergence rate results as that of our embedded software architectures, which demonstrated that these results were not impacted by our hardware design decisions. These results make the streaming interface-based FPGA-HLS-based GHA IPs more suitable for resource-constrained and power-constrained embedded and portable systems, including embedded neuromorphic computing platforms/applications.
These experimental results are encouraging and illustrate great potential in utilizing FPGA-based architectures to support neuromorphic computing applications, especially with the resource-constrained neuromorphic devices/platforms. From our investigation on the existing works and to the best of our knowledge, no similar work exists in the published literature that provides an FPGA-HLS-based hardware accelerator for the GHA for neuromorphic computing applications/platforms, highlighting the uniqueness of our contribution.
As future work, we are planning to investigate and integrate dynamic and partial reconfiguration techniques [40,87] to create dynamic reconfigurable architectures (similar to [88,89,90]) to enable real-time adaptation to varying workloads and data characteristics to further enhance the adaptability traits of our proposed FPGA-HLS-based GHA hardware IP, in order to further capture the biological dynamics necessary for neuromorphic computing platforms/application. Also, as future work, we will explore and compose parallel computing architectures, such as [91], for our GHA hardware IP to further improve the speedup, while considering the speed–space tradeoffs of the resource-constrained neuromorphic computing platforms. To facilitate this, we will integrate multi-ported memory designs, such as [92], to potential parallel computing hardware architectures, which would significantly improve the speed-performance. Also, as future work, we are planning to utilize much larger datasets, (with continuous, temporal, or event-driven traits), to process our proposed HLS-based GHA architecture; to investigate and analyze low classification accuracy results for Breast Cancer datasets; to explore the feasibility of various hardware optimization techniques, such as low-precision fixed-point or customized quantization, to further enhance the area-efficiency and power-efficiency of our proposed GHA hardware accelerator; to investigate and analyze alternative hardware accelerators such as in [93,94]; and to explore and compute other performance metrics including normalized metric such as operations per second or energy per operation.

Author Contributions

Conceptualization, S.S. and D.G.P.; Data curation, S.S.; Formal analysis, S.S. and D.G.P.; Funding acquisition, D.G.P.; Investigation, S.S.; Methodology, S.S.; Project administration, D.G.P.; Resources, D.G.P.; Software, S.S.; Supervision, D.G.P.; Validation, S.S.; Visualization, S.S.; Writing—original draft, S.S.; Writing—review and editing, D.G.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A survey of neuromorphic computing and neural networks in hardware. arXiv 2017, arXiv:1705.06963. [Google Scholar] [CrossRef]
  2. Kyle, D. Neuromorphic Computing: The Future of AI. Report no. 1663; Los Alamos National Laboratory: Santa Fe, NM, USA, 2025. Available online: https://www.lanl.gov/media/publications/1663/1269-neuromorphic-computing (accessed on 8 January 2026).
  3. Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.; Guo, C.; Modha, D.S.; et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 2014, 345, 668–673. [Google Scholar] [CrossRef] [PubMed]
  4. Davies, M.; Wild, A.; Orchard, G.; Sandamirskaya, Y.; Guerra, G.A.F.; Joshi, P.; Plank, P.; Risbud, S.R. Advancing Neuromorphic Computing with Loihi: A Survey of Results and Outlook; IEEE: New York, NY, USA, 2021; Volume 109, pp. 911–934. [Google Scholar] [CrossRef]
  5. Ahmadvand, R.; Sharif, S.S.; Banad, Y.M. Neuromorphic Digital-Twin-based Controller for Indoor Multi-UAV Systems Deployment. IEEE J. Indoor Seamless Position. Navig. 2025, 3, 165–174. [Google Scholar] [CrossRef]
  6. Sanyal, S.; Joshi, A.; Nagaraj, M.; Manna, R.K.; Roy, K. Energy-Efficient Autonomous Aerial Navigation with Dynamic Vision Sensors: A Physics-Guided Neuromorphic Approach. arXiv 2025, arXiv:2502.05938. [Google Scholar]
  7. Pan, Y.; Jiang, H.; Chen, J.; Li, Y.; Zhao, H.; Zhou, Y.; Liu, T. Eg-spikeformer: Eye-gaze guided transformer on spiking neural networks for medical image analysis. In Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), Houston, TX, USA, 14–17 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
  8. Srivastava, A.; Parmar, V.; Patel, S.; Chaturvedi, A. Adaptive cyber defense: Leveraging neuromorphic computing for advanced threat detection and response. In Proceedings of the 2023 IEEE International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India, 14–16 June 2023; IEEE: New York, NY, USA, 2023; pp. 1557–1562. [Google Scholar]
  9. Lu, S.; Xiao, X. Neuromorphic computing for smart agriculture. Agriculture 2024, 14, 1977. [Google Scholar] [CrossRef]
  10. Bersuker, G.; Mason, M.; Jones, K.L. Neuromorphic Computing: The Potential for High-Performance Processing in Space; Game Changer, Center for Space Policy and Strategy: Crystal City, VA, USA, 2018; pp. 1–12. Available online: https://aerospace.org/sites/default/files/2018-11/Bersuker_NeuromorphicComputing_11212018.pdf (accessed on 16 January 2026).
  11. Allied Market Research. Neuromorphic Computing Market Outlook 2030; SE: Electronic Systems and Devices; Report Code: A13743; Allied Market Research: Portland, OR, USA, 2021; Available online: https://www.alliedmarketresearch.com/neuromorphic-computing-market-A13743 (accessed on 11 January 2026).
  12. Soori, M.; Arezoo, B.; Dastres, R. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cogn. Robot. 2023, 3, 54–70. [Google Scholar] [CrossRef]
  13. LaBoone, P.A.; Marques, O. Overview of the future impact of wearables and artificial intelligence in healthcare workflows and technology. Int. J. Inf. Manag. Data Insights 2024, 4, 100294. [Google Scholar] [CrossRef]
  14. Salendra, S.; Krishna, B.; Chaitanya, J.; Aluvalu, R.; Reddy, K.S. Smart Cities and AI: Real-Time Traffic Management and Energy Optimization with IoT Integration. In Proceedings of the 2025 IEEE 3rd International Conference on Data Science and Information System (ICDSIS), Hassan, India, 16–17 May 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
  15. Nguyen, T.V.; Wang, M.; Maisto, D. Addressing large scale computing challenges in neuroscience: Current advances and future directions. Front. Neuroinform. 2024, 18, 1534396. [Google Scholar] [CrossRef]
  16. Kudithipudi, D.; Schuman, C.; Vineyard, C.M.; Pandit, T.; Merkel, C.; Kubendran, R.; Furber, S. Neuromorphic computing at scale. Nature 2025, 637, 801–812. [Google Scholar] [CrossRef]
  17. Mead, C. Neuromorphic electronic systems. Proc. IEEE 2002, 78, 1629–1636. [Google Scholar] [CrossRef]
  18. Indiveri, G.; Linares-Barranco, B.; Hamilton, T.J.; Schaik, A.V.; Etienne-Cummings, R.; Delbruck, T.; Boahen, K. Neuromorphic silicon neuron circuits. Front. Neurosci. 2011, 5, 73. [Google Scholar] [CrossRef]
  19. Indiveri, G.; Liu, S.C. Memory and information processing in neuromorphic systems. Proc. IEEE 2015, 103, 1379–1397. [Google Scholar] [CrossRef]
  20. Sun, J.; Sun, J.; Li, X.; Sun, Y.; Hong, Q.; Wang, C. A Review of Recent Developments in Neuromorphic Computing Based on Emerging Memory Devices: J. Sun et al. Nonlinear Dyn. 2025, 113, 33035–33061. [Google Scholar] [CrossRef]
  21. Schuman, C.D.; Kulkarni, S.R.; Parsa, M.; Mitchell, J.P.; Date, P.; Kay, B. Opportunities for neuromorphic computing algorithms and applications. Nat. Comput. Sci. 2022, 2, 10–19. [Google Scholar] [CrossRef]
  22. AbuHamra, N.; Khan, M.U.; Hassan, E.; Qutayri, M.A.; Mohammad, B. Neuromorphic Computing with Memcapacitors: Advancements, Challenges, and Future Directions. Adv. Electron. Mater. 2025, 11, e00250. [Google Scholar] [CrossRef]
  23. Roy, K.; Jaiswal, A.; Panda, P. Towards spike-based machine intelligence with neuromorphic computing. Nature 2019, 575, 607–617. [Google Scholar] [CrossRef]
  24. Pawlak, W.A.; Howard, N. Neuromorphic algorithms for brain implants: A review. Front. Neurosci. 2025, 19, 1570104. [Google Scholar] [CrossRef] [PubMed]
  25. Guo, W.; Fouda, M.E.; Eltawil, A.M.; Salama, K.N. Efficient training of spiking neural networks with temporally-truncated local backpropagation through time. Front. Neurosci. 2023, 17, 1047008. [Google Scholar] [CrossRef] [PubMed]
  26. Javanshir, A.; Nguyen, T.T.; Mahmud, M.P.; Kouzani, A.Z. Advancements in algorithms and neuromorphic hardware for spiking neural networks. Neural Comput. 2022, 34, 1289–1328. [Google Scholar] [CrossRef]
  27. Yamazaki, K.; Vo-Ho, V.K.; Bulsara, D.; Le, N. Spiking neural networks and their applications: A review. Brain Sci. 2022, 12, 863. [Google Scholar] [CrossRef] [PubMed]
  28. Sanger, T.D. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Netw. 1989, 2, 459–473. [Google Scholar] [CrossRef]
  29. Munakata, Y.; Pfaffly, J. Hebbian learning Algorithm and development. Dev. Sci. 2004, 7, 141–148. [Google Scholar] [CrossRef] [PubMed]
  30. Rumelhart, D.E.; McClelland, J.L.; PDP Research Group. Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations; The MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
  31. Kulkarni, S.R.; Parsa, M.; Mitchell, J.P.; Schuman, C.D. Benchmarking the performance of neuromorphic and spiking neural network simulators. Neurocomputing 2021, 447, 145–160. [Google Scholar] [CrossRef]
  32. Finkbeiner, J.; Gmeinder, T.; Pupilli, M.; Titterton, A.; Neftci, E. Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 11996–12005. [Google Scholar] [CrossRef]
  33. Shrestha, A.; Fang, H.; Mei, Z.; Rider, D.P.; Wu, Q.; Qiu, Q. A survey on neuromorphic computing: Models and hardware. IEEE Circuits Syst. Mag. 2022, 22, 6–35. [Google Scholar] [CrossRef]
  34. Lin, S.J.; Hwang, W.J.; Lee, W.H. FPGA implementation of Generalized Hebbian Algorithm for texture classification. Sensors 2012, 12, 6244–6268. [Google Scholar] [CrossRef]
  35. Mohsin, M.A.; Perera, D.G. High-Level Synthesis Based FPGA Hardware Architecture for PCA+SVM for Real-Time Processing on Edge Computing Platforms. IEEE Access 2025, 13, 214835–214860. [Google Scholar] [CrossRef]
  36. Madsen, A.K.; Perera, D.G. Towards Composing Efficient FPGA-Based Hardware Accelerators for Physics-Based Model Predictive Control Smart Sensor for HEV Battery Cell Management. IEEE Access 2023, 11, 106141–106171. [Google Scholar] [CrossRef]
  37. Miró, J.P.; Mohsin, M.A.; Alkamil, A.; Perera, D.G. FPGA-based Hardware Accelerator for Bottleneck Residual Blocks of MobileNetV2 Convolutional Neural Networks. In Proceedings of the IEEE Mid-West Symposium on Circuits and Systems (MWCAS’25), Lansing, MI, USA, 10–13 August 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
  38. Garcia, L.H.; Alkamil, A.; Mohsin, M.A.; Menzel, J.; Perera, D.G. FPGA-Based Hardware Architecture for Sequence Alignment by Genetic Algorithm. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’25), London, UK, 25–28 May 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
  39. Badhoutiya, A.; Jaffer, Z.; Hussein, H.M.; Juyal, A.; Mittal, M.; Anand, R. Field Programmable Gate Array: An Extensive Review, Recent Trends, Challenges and Applications. In Proceedings of the 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; IEEE: New York, NY, USA, 2024; pp. 1084–1090. [Google Scholar]
  40. Darshika, G.P. Analysis of FPGA-Based Reconfiguration Methods for Mobile and Embedded Applications. In Proceedings of the 12th ACM FPGAWorld International Conference, (FPGAWorld’15); Association for Computing Machinery: Stockholm, Sweden, 2015; pp. 15–20. [Google Scholar]
  41. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015. [Google Scholar]
  42. Mittal, S. A Survey of FPGA-Based Accelerator for Convolutional Neural Networks. Neural Comput. Appl. 2018, 32, 1109–1139. [Google Scholar] [CrossRef]
  43. Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Netw. 1997, 10, 1659–1671. [Google Scholar] [CrossRef]
  44. Marković, D.; Mizrahi, A.; Querlioz, D.; Grollier, J. Physics for neuromorphic computing. Nat. Rev. Phys. 2020, 2, 499–510. [Google Scholar] [CrossRef]
  45. Mehonic, A.; Kenyon, A.J. Brain-inspired computing needs a master plan. Nature 2022, 604, 255–260. [Google Scholar] [CrossRef]
  46. Davies, M.; Srinivasa, N.; Lin, T.H.; Chinya, G.; Cao, Y.; Choday, S.H.; Wang, H. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]
  47. Furber, S.; Bogdan, P. Spinnaker—A Spiking Neural Network Architecture; John Soldatos: Athens, Greece, 2020; p. 350. [Google Scholar]
  48. Peter, R. The Human Brain and Neuromorphic Computing in Singularity 2030. Available online: https://singularity2030.ch/the-human-brain-and-neuromorphic-computing/ (accessed on 25 November 2025).
  49. Lin, J.W. Artificial neural network related to biological neuron network: A review. Adv. Stud. Med. Sci. 2017, 5, 55–62. [Google Scholar] [CrossRef]
  50. Date, P.; Potok, T.; Schuman, C.; Kay, B. Neuromorphic computing is turing-complete. In Proceedings of the International Conference on Neuromorphic Systems 2022; Association for Computing Machinery (ACM): New York, NY, USA, 2022; pp. 1–10. [Google Scholar]
  51. Zheng, Z.; Li, Y.; Chen, J.; Zhou, P.; Chen, X.; Liu, Y. Threshold Neuron: A Brain-inspired Artificial Neuron for Efficient On-device Inference. arXiv 2024, arXiv:2412.13902. [Google Scholar]
  52. Ali, A.H.; Alhayali, R.A.I.; Mohammed, M.A.; Sutikno, T. An effective classification approach for big data with parallel generalized Hebbian algorithm. Bull. Electr. Eng. Inform. 2021, 10, 3393–3402. [Google Scholar]
  53. Gorrell, G. Generalized Hebbian algorithm for incremental singular value decomposition in natural language processing. In 11th Conference of the EUROPEAN Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Trento, Italy, 2006; pp. 97–104. [Google Scholar]
  54. Rossmann, M.; Jost, T.; Goser, K.; Bühlmeier, A.; Manteuffel, G. Exponential Hebbian on-line learning implemented in FPGAs. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 1996; pp. 767–772. [Google Scholar]
  55. Lin, S.J.; Hung, Y.T.; Hwang, W.J. Efficient hardware architecture based on generalized Hebbian algorithm for texture classification. Neurocomputing 2011, 74, 3248–3256. [Google Scholar] [CrossRef]
  56. Lin, S.J.; Hung, Y.T.; Hwang, W.J. Fast principal component analysis based on hardware architecture of generalized hebbian algorithm. In International Symposium on Intelligence Computation and Applications; Springer: Berlin/Heidelberg, Germany, 2010; pp. 505–515. [Google Scholar]
  57. Yu, B.; Mak, T.; Li, X.; Xia, F.; Yakovlev, A.; Sun, Y.; Poon, C.S. Real-time FPGA-based multichannel spike sorting using hebbian eigenfilters. IEEE J. Emerg. Sel. Top. Circuits Syst. 2012, 1, 502–515. [Google Scholar] [CrossRef]
  58. Chen, Y.L.; Hwang, W.J.; Ke, C.E. Efficient VLSI Architecture for Spike Sorting Based on Generalized Hebbian Algorithm. In ESANN 2013 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 24–26 April 2013; i6doc.com: Bruges, Belgium, 2013; ISBN 978-2-87419-081-0. Available online: https://i6doc.com/en/book/?GCOI=28001100131010 (accessed on 16 January 2026).
  59. Hwang, W.J.; Ke, C.E.; Lai, S.Y.; Wu, J.G. Efficient FPGA-Based Architecture for Spike Sorting Using Generalized Hebbian Algorithm. In Proceedings of the 9th International Conference on Systems (ICONS); IARIA: Wilmington, DE, USA, 2014; Available online: https://personales.upv.es/thinkmind/dl/conferences/icons/icons_2014/icons_2014_1_20_40062.pdf (accessed on 16 January 2026).
  60. Chen, Y.L.; Hwang, W.J.; Ke, C.E. An efficient VLSI architecture for multi-channel spike sorting using a generalized Hebbian algorithm. Sensors 2015, 15, 19830–19851. [Google Scholar] [CrossRef] [PubMed]
  61. Sharma, S.; Perera, D.G. Analysis of Generalized Hebbian Learning Algorithm for Neuromorphic Hardware Using Spinnaker. arXiv 2024, arXiv:2411.11575. [Google Scholar] [CrossRef]
  62. AMD. ZCU104 Evaluation Board User Guide (UG1267) (v 1.1). 2018. Available online: https://docs.amd.com/v/u/en-US/ug1267-zcu104-eval-bd (accessed on 20 November 2025).
  63. AMD. Zynq UltraScale+ MPSoC Data Sheet: Overview (DS891) (v 1.11.1). 2025. Available online: http://docs.amd.com/v/u/en-US/ds891-zynq-ultrascale-plus-overview (accessed on 20 November 2025).
  64. Xilinx Inc. AXI Reference Guide, (UG761), (v14.3). 2012. Available online: https://www.xilinx.com/support/documents/ip_documentation/axi_ref_guide/latest/ug761_axi_reference_guide.pdf (accessed on 20 November 2025).
  65. Xilinx Inc. Vivado Design Suite User Guide: High-Level Synthesis, (UG902), (v2021.1). 2021. Available online: https://www.amd.com/content/dam/xilinx/support/documents/sw_manuals/xilinx2020_2/ug902-vivado-high-level-synthesis.pdf (accessed on 20 November 2025).
  66. AMD. Vitis Tutorials: Hardware Acceleration (XD099), (v2025.1). 2025. Available online: https://docs.amd.com/r/en-US/Vitis-Tutorials-Vitis-Hardware-Acceleration/The-Vitis-Flow (accessed on 20 November 2025).
  67. AMD. Vitis Unified Software Platform Documentation: Embedded Software Development (UG1400), (v2023.2). 2023. Available online: https://docs.amd.com/r/2023.2-English/ug1400-vitis-embedded/Performing-Standalone-Application-Debug (accessed on 20 November 2025).
  68. Shanika Sithumini, W. PL and PS Interconnection, in Medium. 2025. Available online: https://medium.com/@e19380/pl-and-ps-interconnection-b69d95c2f198 (accessed on 16 January 2026).
  69. Xilinx, AXI Timer v2.0 LogiCORE IP Product Guide, PG079. 5 October 2016. Available online: https://docs.xilinx.com/r/en-US/pg079-axi-timer (accessed on 20 November 2025).
  70. Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu (accessed on 20 November 2025).
  71. Aeberhard, S.; Forina, M. Wine—UCI Machine Learning Repository. 1991. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 10 October 2025).
  72. Max, A.; Little, P.; McSharry, S.; Costello, D.; Moroz, I. Parkinson—UCI Machine Learning Repository. 2007. Available online: https://archive.ics.uci.edu/dataset/174/parkinsons (accessed on 10 October 2025).
  73. Detrano, R.; Jánosi, A.; Steinbrunn, W.; Pfisterer, M.; Schmid, J.; Sandhu, S.; Guppy, K.; Lee, S.; Froelicher, V. Heart Disease—UCI Machine Learning Repository. 1989. Available online: https://archive.ics.uci.edu/dataset/45/heart+disease (accessed on 10 October 2025).
  74. She, Q.; Chen, K.; Ma, Y.; Nguyen, T.; Zhang, Y. Liver Disorders—UCI Machine Learning Repository. 2018. Available online: https://archive.ics.uci.edu/dataset/60/liver+disorders (accessed on 10 October 2025).
  75. Zhang, X.; Zhu, X.; Lessard, L. Breast Cancer—UCI Machine Learning Repository. 2019. Available online: https://archive.ics.uci.edu/dataset/14/breast+cancer (accessed on 10 October 2025).
  76. Kulkarni, S.; Parsa, M.; Mitchell, J.P.; Schuman, C. Training spiking neural networks with synaptic plasticity under integer representation. In International Conference on Neuromorphic Systems 2021; Association for Computing Machinery (ACM): New York, NY, USA, 2021; pp. 1–7. [Google Scholar]
  77. Javanshir, A.; Nguyen, T.T.; Mahmud, M.P.; Kouzani, A.Z. Training spiking neural networks with metaheuristic algorithms. Appl. Sci. 2023, 13, 4809. [Google Scholar] [CrossRef]
  78. Cong, J.; Liu, B.; Neuendorffer, S.; Noguera, J.; Vissers, K.; Zhang, Z. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2011, 30, 473–491. [Google Scholar]
  79. AMD. Vitis High-Level Synthesis User Guide (UG1399), (v2022.2). 2022. Available online: https://docs.amd.com/r/2022.2-English/ug1399-vitis-hls/Introduction (accessed on 16 January 2026).
  80. Nguyen, D.; Widrow, B. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In 1990 IJCNN International Joint Conference on Neural Networks; IEEE: New York, NY, USA, 1990; pp. 21–26. [Google Scholar]
  81. Wei, Y.; Li, X.; Lang, F.; Wang, Y.; Ma, T. Analysis of random initialization methods in machine learning. In Proceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence; Association for Computing Machinery (ACM): New York, NY, USA, 2024; pp. 405–409. [Google Scholar]
  82. Foody, G.M. Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient. PLoS ONE 2023, 18, e0291908. [Google Scholar] [CrossRef] [PubMed]
  83. Oja, E. Simplified neuron model as a principal component analyzer. J. Math. Biol. 1982, 15, 267–273. [Google Scholar] [CrossRef] [PubMed]
  84. Ting, K.M. Error Rate. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2011. [Google Scholar] [CrossRef]
  85. He, J.; Lin, G. Average convergence rate of evolutionary algorithms. IEEE Trans. Evol. Comput. 2015, 20, 316–321. [Google Scholar] [CrossRef]
  86. AMD. Power Design Manager 2025. Available online: https://www.amd.com/en/products/software/adaptive-socs-and-fpgas/power-design-manager.html (accessed on 20 October 2025).
  87. Perera, D.G.; Li, K.F. A design methodology for mobile and embedded applications on FPGA-based dynamic reconfigurable hardware. Int. J. Embed. Syst. 2019, 11, 661–677. [Google Scholar] [CrossRef]
  88. Alkamil, A.; Perera, D.G. Towards dynamic and partial reconfigurable hardware architectures for cryptographic algorithms on embedded devices. IEEE Access 2020, 8, 221720–221742. [Google Scholar] [CrossRef]
  89. Shahrouzi, S.N.; Perera, D.G. Dynamic partial reconfigurable hardware architecture for principal component analysis on mobile and embedded devices. Eurasip J. Embed. Syst. 2017, 2017, 25. [Google Scholar] [CrossRef]
  90. Perera, D.G.; Li, K.F. FPGA-based reconfigurable hardware for compute intensive data mining applications. In Proceedings of the 2011 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Barcelona, Spain, 26–28 October 2011; IEEE: New York, NY, USA, 2011; pp. 100–108. [Google Scholar]
  91. Perera, D.G.; Li, K.F. Parallel Computation of Similarity Measures Using an FPGA-Based Processor Array. In Proceedings of the 22nd IEEE International Conference on Advanced Information Networking and Applications, (AINA’08), Okinawa, Japan, 25–28 March 2008; IEEE: New York, NY, USA, 2008; pp. 955–962. [Google Scholar]
  92. Shahrouzi, S.N.; Alkamil, A.; Perera, D.G. Towards Composing Optimized Bi-Directional Multi-Ported Memories for Next-Generation FPGAs. IEEE Access 2020, 8, 91531–91545. [Google Scholar] [CrossRef]
  93. Wen, Z.; Wan, W.; Pakala, A.R.; Zou, Y.; Wei, W.C.; Yang, K. PICO-RAM: A PVT-insensitive analog compute-in-memory SRAM macro with in situ multi-bit charge computing and 6T thin-cell-compatible layout. IEEE J. Solid-State Circuits 2024, 60, 308–320. [Google Scholar]
  94. Wan, W.; Kubendran, R.; Schaefer, C.; Eryilmaz, S.B.; Zhang, W.; Wu, D.; Cauwenberghs, G. A compute-in-memory chip based on resistive random-access memory. Nature 2022, 608, 504–512. [Google Scholar] [CrossRef]
Figure 1. Illustration of a biological neuron, revised from [48,49].
Figure 1. Illustration of a biological neuron, revised from [48,49].
Electronics 15 01725 g001
Figure 2. Our proposed system-level architecture for FPGA-HLS-based GHA hardware IP with a memory-mapped interface.
Figure 2. Our proposed system-level architecture for FPGA-HLS-based GHA hardware IP with a memory-mapped interface.
Electronics 15 01725 g002
Figure 3. Our proposed system-level architecture for FPGA-HLS-based GHA hardware IP with a streaming interface.
Figure 3. Our proposed system-level architecture for FPGA-HLS-based GHA hardware IP with a streaming interface.
Electronics 15 01725 g003
Figure 4. High-Level Architecture for our FPGA-HLS-based GHA Hardware IP with a memory-mapped interface.
Figure 4. High-Level Architecture for our FPGA-HLS-based GHA Hardware IP with a memory-mapped interface.
Electronics 15 01725 g004
Figure 5. High-Level Architecture for our FPGA-HLS-based GHA Hardware IP with a streaming interface.
Figure 5. High-Level Architecture for our FPGA-HLS-based GHA Hardware IP with a streaming interface.
Electronics 15 01725 g005
Figure 6. Total On-Chip Power, Dynamic Power and Static Power consumption for the memory-mapped interface-based FPGA-HLS-Based GHA hardware IP for varying datasets.
Figure 6. Total On-Chip Power, Dynamic Power and Static Power consumption for the memory-mapped interface-based FPGA-HLS-Based GHA hardware IP for varying datasets.
Electronics 15 01725 g006
Figure 7. Total On-Chip Power, Dynamic Power and Static Power consumption for the streaming interface-based FPGA-HLS-Based GHA hardware IP for varying datasets.
Figure 7. Total On-Chip Power, Dynamic Power and Static Power consumption for the streaming interface-based FPGA-HLS-Based GHA hardware IP for varying datasets.
Electronics 15 01725 g007
Figure 8. Total On-Chip Power Consumption: memory-mapped vs. streaming interface-based designs for varying datasets.
Figure 8. Total On-Chip Power Consumption: memory-mapped vs. streaming interface-based designs for varying datasets.
Electronics 15 01725 g008
Figure 9. Power Consumption Breakdown: PS vs. PL for the memory-mapped interface-based GHA IP using the Wine dataset.
Figure 9. Power Consumption Breakdown: PS vs. PL for the memory-mapped interface-based GHA IP using the Wine dataset.
Electronics 15 01725 g009
Figure 10. Power Consumption Breakdown: PS vs. PL for the streaming interface-based GHA IP using the Wine dataset.
Figure 10. Power Consumption Breakdown: PS vs. PL for the streaming interface-based GHA IP using the Wine dataset.
Electronics 15 01725 g010
Table 1. Summary: Analysis of existing works on hardware architectures for the GHA.
Table 1. Summary: Analysis of existing works on hardware architectures for the GHA.
Research WorkPlatformResource UtilizationClk Freq.
(MHz)
Power
(mW)
In [54], 1996, Exponential Hebbian learning Xilinx XC3090 FPGACLB 320 1.0Not reported
In [55,56], 2010, 2011, GHA-based PCAAltera Cyclone III FPGALE varied 33,723–65,384 with l 3 to 785.0Not reported
In [57], 2011, Hebbian Eigenfilter Spike SortingXilinx XC6S-LX150L FPGASlices varied 777–1113; BRAMs varied 44–65; with 10-bit to 16-bitNot reportedVaried 6.4 to 8.6 with 10-bit to16-bit
In [34], 2012, GHA-based PCAAltera Cyclone IV and Cyclone III FPGAsLE varied 85,271 to 121,940 with l 3 to 750.0129.0
In [58], 2013, GHA-based Spike SortingAltera Cyclone IV FPGALE 9144 for GHA
LE 19,734 for whole
50.0Not reported
In [59], 2014, NoC-based GHA Spike Sorting Altera Cyclone IV FPGALE 15,688; DSP 12850.0Not reported
In [60], 2015, VLSI Multichannel GHA-based Spike SortingTSMC 90 nm Technology ASICVaried 214,090–658,196 μm2, with L 2 to 32. 1.0 to 2.0Varied 152.4 μW/ch -85.8 μW/ch, with channels 8 to 64
Table 2. Summary of benchmark datasets.
Table 2. Summary of benchmark datasets.
Benchmark DatasetsFieldNo. of InstancesNo. of AttributesNo. of Classes
Wine dataset [71]Physics and Chemistry178133
Parkinson dataset [72]Health and Medicine197222
Heart disease dataset [73]Health and Medicine303145
Liver disease dataset [74]Health and Medicine34552
Breast cancer dataset [75]Health and Medicine28692
Table 3. Classification Accuracy, Convergence Rate, and Error Rate: FPGA-HLS-Based GHA hardware for the memory-mapped interface and for the streaming interface.
Table 3. Classification Accuracy, Convergence Rate, and Error Rate: FPGA-HLS-Based GHA hardware for the memory-mapped interface and for the streaming interface.
Dataset/Data SizeClassification
Accuracy (%)
Convergence
Rate (%)
Error
Rate (%)
Wine/(178 × 13)98.8898.901.12
Parkinsons’ Disease/(197 × 22)91.6792.608.33
Heart Disease/(303 × 13)97.6596.202.35
Breast Cancer/(286 × 9)11.7698.6088.24
Liver Disease/(345 × 5)97.6589.902.35
Table 4. Total available hardware resources on the Zynq ZCU104 board.
Table 4. Total available hardware resources on the Zynq ZCU104 board.
No. of LUTsNo. of FFsNo. of BRAMsNo. of DSP Slices
230,400460,8003121728
Table 5. Hardware Resource Utilization: FPGA-HLS-Based GHA hardware for the memory-mapped interface.
Table 5. Hardware Resource Utilization: FPGA-HLS-Based GHA hardware for the memory-mapped interface.
Dataset/Data SizeNo. of LUTsNo. of BRAMsNo. of DSP48 SlicesNo. of FFs
Wine/(178 × 13)17,171/(7.45%)4/(1.28%)70/(4.05%)20,233/(4.39%)
Parkinsons’ Disease/(197 × 22)6248/(2.71%)5/(1.60%)35/(2.02%)7737/(1.67%)
Heart Disease/(303 × 13)8900/(3.86%)3/(0.96%)40/(2.31%)11,000/(2.38%)
Breast Cancer/(286 × 9)11,500/(4.99%)4/(1.28%)50/(2.89%)14,500/(3.14%)
Liver Disease/(345 × 5)9800/(4.25%)3/(0.96%)38/(2.19%)12,000/(2.60%)
Table 6. Hardware Resource Utilization: FPGA-HLS-Based GHA hardware for the streaming interface.
Table 6. Hardware Resource Utilization: FPGA-HLS-Based GHA hardware for the streaming interface.
Dataset/Data SizeNo. of LUTsNo. of BRAMsNo. of DSP48 SlicesNo. of FFs
Wine/(178 × 13)16,182/(7.02%)6/(1.92%)70/(4.05%)16,487/(3.57%)
Parkinsons’ Disease/(197 × 22)6040/(2.62%)6/(1.92%)35/(2.02%)7500/(1.62%)
Heart Disease/(303 × 13)8550/(3.71%)5/(1.60%)40/(2.31%)10,200/(2.21%)
Breast Cancer/(286 × 9)11,020/(4.78%)6/(1.92%)50/(2.89%)14,000/(3.03%)
Liver Disease/(345 × 5)9503/(4.12%)5/(1.60%)38/(2.19%)11,200/(2.43%)
Table 7. Execution Time and Speedup: embedded software vs. embedded hardware for the memory-mapped and streaming interface-based designs.
Table 7. Execution Time and Speedup: embedded software vs. embedded hardware for the memory-mapped and streaming interface-based designs.
Dataset/Data SizeEmbedded Software Time (ms)Embedded
Hardware Time:
Memory-Mapped (ms)
Embedded Hardware Time: Streaming (ms)Speedup: Memory-MappedSpeedup: Streaming
Wine/(178 × 13)1.5341.5320.0301.0051.13
Parkinsons’ Disease/(197 × 22)2.1791.9000.0541.1440.35
Heart Disease/(303 × 13)2.3752.3000.0551.0343.18
Breast Cancer/(286 × 9)0.5810.4600.0121.2648.41
Liver Disease/(345 × 5)6.1445.1400.2041.1930.11
Table 8. On-Chip Power Consumption: embedded hardware for the memory-mapped and for streaming interface-based designs.
Table 8. On-Chip Power Consumption: embedded hardware for the memory-mapped and for streaming interface-based designs.
Dataset/Data SizeMemory-Mapped Interface-BasedStreaming Interface-Based
Dynamic
Power (W)
Static
Power (W)
Total on Chip (W)Dynamic
Power (W)
Static
Power (W)
Total on Chip (W)
Wine/(178 × 13)2.9620.693 3.655 2.860 0.693 3.552
Parkinsons’ Disease/(197 × 22)2.992 0.694 3.685 2.875 0.693 3.568
Heart Disease/(303 × 13)3.005 0.694 3.698 2.883 0.693 3.576
Breast Cancer/(286 × 9)3.018 0.694 3.712 2.890 0.693 3.583
Liver Disease/(345 × 5)2.951 0.693 3.644 2.854 0.693 3.547
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sharma, S.; Perera, D.G. High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing. Electronics 2026, 15, 1725. https://doi.org/10.3390/electronics15081725

AMA Style

Sharma S, Perera DG. High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing. Electronics. 2026; 15(8):1725. https://doi.org/10.3390/electronics15081725

Chicago/Turabian Style

Sharma, Shivani, and Darshika G. Perera. 2026. "High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing" Electronics 15, no. 8: 1725. https://doi.org/10.3390/electronics15081725

APA Style

Sharma, S., & Perera, D. G. (2026). High-Level Synthesis-Based FPGA Hardware Accelerator for Generalized Hebbian Learning Algorithm for Neuromorphic Computing. Electronics, 15(8), 1725. https://doi.org/10.3390/electronics15081725

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop