FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition

Madani, Mahdi; Bourennane, El-Bay

doi:10.3390/electronics15112384

Open AccessArticle

FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition

by

Mahdi Madani

^*

and

El-Bay Bourennane

Université Bourgogne Europe, IMVIA UR 7535, 21000 Dijon, France

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2384; https://doi.org/10.3390/electronics15112384

Submission received: 27 April 2026 / Revised: 26 May 2026 / Accepted: 29 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue FPGA-Based Accelerators for Deep Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Many machine learning and deep learning algorithms based on Artificial Neural Networks (ANNs) have been implemented on software platforms for handwritten digit and character recognition. However, an ANN is difficult to deploy on an embedded platform based on a Central Processing Unit (CPU) because of its large computation, complex structure, and frequent memory access. However, Field Programmable Gate Array (FPGA) devices facilitate this task and offer the capability to design fully customizable hardware architectures. Additionally, they provide high flexibility and high parallel computations based on parallel processing techniques, and they contain sufficient on-chip Digital Signal Processing (DSP) blocks useful for complicated multiplications. In this paper, we present a detailed FPGA-based implementation of a handwritten digit recognition system based on a Multi-Layer Perceptron (MLP) model. The internal modules of the network are designed using the VHSIC Hardware Description Language (VHDL) to achieve a high-level optimization on the hardware platform, and the functionality is simulated and tested using Vivado ISIM Tools. The system has been characterized to reach acceptable performance compared to previous approaches. After implementing the whole neural network on a Xilinx Pynq-Z2 board, it occupies in the device 20758 LUTs, 4426 FFs, 3.50 blocks of random-access memory (BRAM), and 42 DSPs. It reaches an execution time of 2.192 µs to recognize a handwritten number, while consuming only 0.36 Watts, and it achieves a classification accuracy of 97%. Additionally, the proposed architecture can be easily scaled on different FPGA devices thanks to its regularity. Therefore, it offers more portability of the architecture and can be used on different real embedded applications.

Keywords:

FPGA implementation; hardware acceleration; optimization; handwritten recognition; MLP classification; deep learning; feature extraction

1. Introduction

Growing computing needs for Artificial Intelligence (AI) algorithm implementation cannot be met by simple CPU-based machines [1], which are generally software and sequential machines, not suitable for use in environments where portability is needed. However, in the last decade, parallel-oriented systems have been used to satisfy these demands of calculations and real-time requirements, such as Graphics Processing Units (GPUs) [2] and FPGAs [3,4]. Moreover, in real applications, the whole design will be implemented on an embedded system characterized by limited memory area and space, and powered by a small self-contained battery. Therefore, considering these constraints, in this work, we focus on using FPGA devices, which are known for their low power consumption, thousands of reprogrammable logic resources, and hundreds of DSP units useful for complicated multiplications [5]. Despite the large number of studies that have dealt with the problem of handwritten character recognition, there are still parameters in need of improvement to achieve an efficient system, easy to implement on a surface-limited embedded system, and responsive to achieve real-time data processing. Therefore, this study is dedicated to designing and implementing an efficient ANN algorithm [6] to recognize the separated handwritten characters applied in postal code addresses, bank cheques, computer vision, etc.

The proposed model consists of using a two-layer MLP method [7] known for its high success rate principally based on a backpropagation training algorithm and appropriate for FPGA implementation. The off-chip learning method adopted in this study is as follows. We begin by training the algorithm on the Jupyter Notebook software platform. We used the standard MNIST dataset containing 60.000 grayscale images of size 28 × 28 across ten classes to train and test the proposed network [8]. We used 80% of the dataset to train the model and 20% to test it. After that, when the learning is accomplished, and the weights and biases are set to minimal error, we save these parameters in a Read-Only Memory (ROM) memory and we use them as inputs to the hardware architecture designed using VHDL and implemented on a Zynq-7 FPGA device via the Pynq-Z2 development board [9,10]. Then, we simulated the architecture using Vivado 2022.1 ISIM tools to test the best response (must generate the expected number) of the algorithm faced with unknown data (not used during the training phase to learn the algorithm) and to estimate the algorithm execution time. Finally, we optimized the architecture implemented to enhance the performance in terms of timing metrics (frequency and throughput), logic resources (area), and power consumption.

The main objective of this paper is to accelerate intelligent system performance based on an FPGA platform using the off-chip learning technique. A highly nonlinear handwritten digit recognition problem was performed with the MLP training method on external software, and the extracted Neural Network (NN) parameters were used to design and implement the NN architecture on the hardware platform. To reach the best processing time required to recognize a handwritten number, we used in our FPGA implementation three parallel processing stages. The first one executes seven average pooling functions to calculate the mean value of one pixel from four. The second stage executes 32 parallel multiplications to calculate the output of the hidden layer. The third one executes 10 parallel multiplications to calculate the output of the second layer. To accelerate the processing of the next input data, we connected the model layers using pipeline registers. This combination of parallel processing and pipeline techniques allowed the proposed design to achieve the best timing and material performance in addition to a low power consumption onside the FPGA device.

The remainder of this paper is organized as follows. Section 2 gives an overview of the related work. Section 3 presents the internal architecture of the MLP model proposed for handwriting recognition. Section 4 describes the design and gives the FPGA implementation and simulation results. Section 5 is devoted to performance evaluation and comparative study within the literature. Finally, Section 6 concludes this paper and addresses some ideas to improve this work in our future work.

2. Analysis of Related Work

The literature over the past two decades highlights a significant evolution in ANN approaches for digit and character recognition. While historical methodologies predominantly relied on software platforms, recent research heavily focuses on hardware-accelerated architectures to overcome real-time computational bottlenecks.

Several comprehensive reviews have synthesized these advancements. Research in this domain explores exhaustive overviews of handwritten text models using a Convolution Neural Network (CNN) [11], incremental recognition, line and word segmentation, slope and slant correction, and zoning techniques [12]. Other models have used Optical Character Recognition (OCR) systems Memory-Tree, then proposed an FPGA implementation of the architecture [13]. These systems can be used for practical applications to better understand progress in the recognition of modern and historical documents on a global scale, with a specific focus for the French language and other languages, including datasets used in the International Conference on Frontiers of Handwriting Recognition (ICFHR) and the International Conference on Document Analysis and Recognition (ICDAR) competitions [14]. It can also used for other real applications using unsupervised feature learning methods (reading house numbers from street-level photos), such as: scanned documents, typed digital documents, and handwritten documents [15]. An offline–online handwriting recognition method has been developed to analyze documents written and stored to recognize the characters to use an electronic pen to write, and then they recognize characters in real-time [16].

Beyond theoretical surveys, a wide spectrum of research addresses handwriting recognition through diverse signal processing, machine learning, deep learning algorithms, tree-based algorithms, etc. Traditional frameworks frequently employ rigorous structural segmentation and pre-processing pipelines. A known system for typewritten Arabic characters recognition workflows has converted grayscale scans into binary images, utilizing median filtering to eliminate isolated noise before executing character splitting and feature extraction [17,18]. For isolated Arabic characters recognition, an OCR system has used the Fuzzy ART neural network to recognize the characters and implemented the system on a software platform based on a fixed-point digital signal processor (TITMS320C6416T) [19]. Another OCR pipeline is typically formalized into distinct stages: acquisition, pre-processing, segmentation, feature extraction, classification, and post-processing, offering high processing speed and accuracy [20]. For Persian handwritten character recognition, an artificial neural network based on a decision tree has been presented [21]. For English characters, an ANN-based system has been developed using two different learning algorithms (Resilient Backpropagation and Scaled conjugate gradient) to train the model [22]. To recognize characters incrementally, a system based on reducing the style variation by considering the slope of text (estimated using the slope of the baseline) concludes there is no contribution of ascenders and descenders in the initial formation; they are removed if possible [23]. To extract the features of the symbols or characters, a method based on separating the processed image data into static and dynamic zones has been implemented [24]. To achieve optimal accuracy, a CNN-based method was used to recognize characters from a segmented image, along with the slope and tilt correction method with the lowest accuracy [25].

Recently, deep learning has significantly improved accuracy benchmarks. CNNs are widely leveraged to resolve complex handwriting variations on the MNIST digits dataset due to their superior capability for automatic feature extraction [25,26,27]. To maximize efficiency, hybrid architectures successfully combine CNNs for robust feature mapping with Support Vector Machines (SVMs) acting as the final high-accuracy classifier [28]. Additionally, the literature clearly differentiates between offline recognition (analyzing stored, scanned documents) and online recognition (decoding electronic pen strokes in real-time) [16], even under unsupervised feature learning constraints from noisy real-world images [15]. In terms of classification models, an ANN-based approach has extracted and classified the geometrical features to automate recognition of handwritten Arabic characters [29], a CNN model architecture has been implemented to improve performance in solving handwriting MNIST digits recognition problem [26], and a combined CNN and SVM system has been built for pre-processing (using the Median Filtering technique), segmentation, and character recognition [28].

To meet strict real-time and high-throughput requirements, FPGAs have emerged as a dominant hardware acceleration technology. Early and lightweight hardware architectures focused on mapping binary grids or low-complexity networks onto programmable logic, including generic hardware-based ANN classifiers for small binary matrices [30], perceptron neural network hardware designs [31], and two-layer feedforward backpropagation topologies optimized for limited object sets (10) [32]. For Farsi and Persian digit recognition, MLP architectures implemented on FPGAs are extensively utilized to manage both feature extraction and classification layers efficiently [33]. To optimize hardware resources, specialized structural approaches have been developed. An FPGA architecture was developed for Persian optical character recognition using horizontal and vertical projections digits by converting the image into a black (digit pixels) and white (background pixels) image to calculate the feature vectors for each digit [34]. For Arabic OCR, use of the hardware–software co-design paradigms split the workload, utilizing a software layer for image pre-processing alongside dedicated FPGA hardware units for real-time segmentation, and character recognition [35]. For Arabic Optical Character Recognition (AOCR), a hardware design extracts features and detects character matching units [36]. Finally, recent trends demonstrate the feasibility of deploying deep learning directly onto hardware without excessive resource consumption. Notable designs include an FPGA-based CNN implementations using 9-bit fixed-point representations, which completely bypass the need for costly DSP blocks while maintaining a 90% classification accuracy [37]. Similarly, heterogeneous co-designs utilize an Altera Cyclone IV chip and several calculation cores connected by a Deep Belief Network (DBN) via hardware programming, achieving high-speed inference for complex handwritten image datasets [38].

To clearly highlight the performance trade-offs, advantages, and limitations of these diverse methodologies, a comparative analysis is presented in Table 1.

Problem Statement and Motivation

As analyzed in Table 1, a persistent gap exists between algorithmic accuracy and hardware efficiency in handwriting recognition systems. While high-accuracy software models (such as deep CNNs or hybrid architectures) suffer from sequential execution delays on conventional processors, existing hardware accelerators often fall into two extremes: they either compromise classification accuracy by relying on overly simplified network topologies, or they demand excessive logic resources and power, limiting their viability for edge-AI deployment. This structural trade-off underscores the clear necessity for a novel, optimized acceleration paradigm.

To bridge this gap, this paper introduces a highly efficient, hardware-accelerated framework using an off-chip learning technique based on a trained MLP model. Unlike conventional approaches that incur heavy on-chip training overheads, our methodology decouples the computational learning phase from the deployment phase. The network is pre-trained on software. Once optimal convergence is achieved, the extracted weights and biases are embedded into on-chip memory, serving as static inputs for a custom-designed hardware architecture.

The core motivation of this work is to maximize processing throughput while minimizing resource utilization and power consumption. To achieve this, the proposed FPGA architecture implements a deeply optimized hardware design divided into three parallel processing stages strategically interconnecting with pipeline registers. The proposed work has been validated on an Vivado simulation and implementation. The results obtained prove that we have been able to successfully circumvent the resource and timing penalties of previous approaches, establishing a highly efficient solution for real-time, resource-constrained intelligent systems.

3. Architecture of Proposed Multi-Layer Perceptron Model

The multi-layer ANN is a powerful area of deep learning. Like all neural networks, it is based on fully connected multi-layer neurons, and each one is trained using Perceptron (weights updated based on a unit function, used in this work) or Adaline (weights updated based on a linear function) rules. The architecture of a single layer is shown in Figure 1 [7]. Since its emergence, the MLP network-based classifier has been widely used to classify handwritten digits.

In this work, we implemented an MLP network based on three layers, namely an input layer, one hidden layer, and an output layer, to perform digit handwritten recognition. Each neuron or node of each of the layers is connected to all neurons in the next layer. Figure 2 illustrates the connections of the three MLP layers in our model. We used the MNIST dataset based on images of size

28 \times 28

pixels to train and test our model.

The processing begins by loading the input image to the input layer (grayscale image of size (

28 \times 28

) pixels. This is followed with an average pooling function (

2 \times 2

) applied to the input image to reduce the input data into images of size

14 \times 14

, as shown in Figure 3. Then, we reshape the resulting image to a vector of size

1 \times 196

. This part transforms the image from a 2D space into a flattened vector in 1D. After that, we connect this input layer (196 neurons) to the hidden layer (32 neurons) by multiplying the weight units and adding the bias unit. This will result, on the one hand, in reducing the dimensionality, and on the other hand, in extracting abstract features from the raw input data. Consequently, each neuron will learn some aspects of the data, focusing on detecting horizontal and vertical edges that draw the digit contained in the input image (pattern corresponding to a specific digit). Finally, we apply the Rectified Linear Unit (ReLU) activation function to the resulting neurons. This function allows the model to learn non-linear relationships between the input features and the output. To connect the hidden layer to the output layer (10 neurons), we repeat the same steps (multiply the weight units, add the bias unit, and apply the ReLU activation function). To generate the predicted class labels, the outputs of the last layer will pass via a maximum threshold function. All this gives a network configured with 6634 hyper-parameters encoded in 8-bit format in the range of −128 to 127. The whole architecture is shown in Figure 4.

To reach the best recognition accuracy, the proposed model extracts a set of micro-features from the input data corresponding to the properties of the expected output data. Therefore, the MLP learning procedure can be summarized as follows:

1.: Starting with the input layer, we propagate data forward to the output layer. This step is the forward propagation. Therefore, we calculate the activation unit ( $a_{i}$ , for i in range 1 to 32) of the hidden layer according to Equation System (1) [7], where X is the input data, W is the weight unit, and B is the bias unit. Note that we use the default Keras dense layer based on Glorot Uniform initialization (Xavier initialization) to initialize the weights.

$\{\begin{matrix} a_{0} (h) = X_{0} \times W_{0, 1} (h) + X_{1} \times W_{1, 1} (h) + \dots + X_{m} \times W_{m, 1} (h) + B_{1} (h) \\ a_{1} (h) = X_{0} \times W_{0, 2} (h) + X_{1} \times W_{1, 2} (h) + \dots + X_{m} \times W_{m, 2} (h) + B_{2} (h) \\ \dots \\ a_{n} (h) = X_{0} \times W_{0, n} (h) + X_{1} \times W_{1, n} (h) + \dots + X_{m} \times W_{m, n} (h) + B_{n} (h), \end{matrix}$

(1)

Then, we pass the result via the ReLU activation function $ϕ (a (h))$ defined by Equation (2) [7]. This allows the nonlinearity needed to solve complex problems like image processing.

$Z (h) = ϕ [L (h)] = \{\begin{matrix} a (h) i f a (h) > 0 \\ 0 O t h e r w i s e, \end{matrix}$

(2)
2.: Based on the output, we calculate the loss using the Mean Squared Error (MSE) (the difference between the predicted and known outcome) method. The error needs to be minimized, as defined by Equation (3) [39], where Y(x) is the expected value and $\dot{Y} (x)$ is the predicted value.

$E (Y, \dot{Y}) = \frac{1}{2 n} \sum | | Y (x) - \dot{Y} (x) {| |}^{2},$

(3)
3.: We backpropagate the error. We find its derivative with respect to each weight in the network, and update the model.
4.: We optimize using the Adam optimizer method.

Repeating these three steps over multiple 200 epochs, known as the training process, will cause the model to learn the ideal weights adjustment that corresponds to each connection (learn ideal weights).

Note that to design our MLP network, we chose a minimum number of layers to facilitate its implementation in an FPGA device using the minimum of material resources. We adopted the ReLU activation function because of its simple implementation, and we defined, after many combination tests, the minimum number of neurons in each layer, ensuring an acceptable accuracy.

4. FPGA Design of Proposed Multi-Layer Perceptron Model

Before implementing our MLP network on a hardware platform, we trained it on the Jupyter Notebook software platform (version 7.0.8) for 200 epochs using a batch size of 16 until reaching a recognition accuracy of handwritten digits of 97%. Then, we extracted the network learned parameters (the weights and bias) and saved them in a ROM unit in the FPGA device. To achieve the good performance by optimizing the low-level synthesis implementation, we used the VHDL description language to design each module on the architecture shown in Figure 5. To ensure the same accuracy on hardware as in software, we did not apply parameter simplification techniques (weights and bias) such as pruning or other similar methods. Therefore, in this section, we present a sample to show that the hardware architecture generates the expected result and the same result as the software model. The details of the implementation of each of the internal modules are given as follows.

Image load module: This module allows the loading of the input image of size $28 \times 28$ from the internal memory into a 2-dimensional (row and column) buffer of the same size. We process the data by slots of four pixels using a bus of 32 bits (8 bits by pixel).
Average Pooling module: from the buffer, we launch seven clusters in parallel, and each one proceeds to calculate the average of four neighboring pixels. The first cluster calculates the average of pixels at positions (0,0), (0,1), (1,0), (1,1), the second one calculates the average of pixels at positions (2,0), (2,1), (3,0), (3,1), the third one calculates the average of pixels at positions (4,0), (4,1), (5,0), (5,1), the fourth one calculates the average of pixels at positions (6,0), (6,1), (7,0), (7,1), the fifth one calculates the average of pixels at positions (8,0), (8,1), (9,0), (9,1), the sixth one calculates the average of pixels at positions (10,0), (10,1), (11,0), (11,1), and the last one calculates the average of pixels at positions (12,0), (12,1), (13,0), (13,1). For the next round, we proceed with columns 14 to 27. Then, we pass to the next two rows, 2 and 3, and so on until we reach the last group of pixels. The architecture is shown in Figure 6. The processing of each pair of rows takes $(2 \times 1) = 2$ clock cycles. Therefore, the whole process takes $(14 \times 2) = 28$ clock cycles.
Reshape module: this intermediate module allows only the reorganization of the data into a 1-dimensional vector to facilitate the calculations and connecting to the hidden layer. Therefore, by considering the input image of size $14 \times 14$ pixels, the module output gives us a vector of length 196 pixels.
Neurons calculation module: as we mentioned above, the trained weights are saved in the internal memory of our FPGA. To connect the input 1-D vector to the hidden layer that generates 32 neurons, we need $196 \times 32$ weight values and 32 bias values, and to connect the hidden layer to the output layer of 10 neurons, we need $32 \times 10$ weight values and 10 bias values. As a result, we need 6634 parameters ( $196 \times 32 + 32 + 32 \times 10 + 10$ ). Note that all the learned parameters are integers encoded in an 8-bit signed format (values in the range of −128 to 127) in the VHDL description (two complement binary representation C2). To implement an accelerated design, during the hidden layer calculation, we run all the neurons in parallel, which results in 32 multiplications in parallel. $M u l_{0} + = W_{o, i} \times P_{i}$ , $M u l_{1} + = W_{1, i} \times P_{i}$ , $\dots$ , $M u l_{31} + = W_{31, i} \times P_{i}$ ( $i = 0 t o 195$ ). W and P are 8-bit VHDL signed values, and $M u l$ is a 32-bit VHDL signed value. This represents parallel stage 2, as illustrated in Figure 7. Therefore, the hidden layer is generated in 197 clock cycles (196 cycles for the weight cumulative multiplications and 1 cycle for the bias addition). Similarly, during the output layer calculation, we run 10 neurons in parallel, which results in 10 multiplications in parallel. $M u l o_{0} + = W_{0, i} \times R L_{i}$ , $M u l o_{1} + = W_{1, i} \times R L_{i}$ , $\dots$ , $M u l o_{9} + = W_{9, i} \times R L_{i}$ ( $i = 0 t o 31$ ). W is an 8-bit, $R L$ is a 16-bit, and $M u l o$ is a 32-bit VHDL signed value. This represents parallel stage 3, as illustrated in Figure 8. Therefore, the output layer is generated in 11 clock cycles (10 cycles for the weight cumulative multiplications and 1 cycle for the bias addition). Note that more parallel calculations are possible, but this increase the required hardware logic resources.
ReLU calculation module: To implement the ReLU function defined by Equation (2), we pipeline the output of the previous module to an intermediate register R of size 32 bits. Then, we compare the content of this register to 0. If negative, we generate 0 as the output value. Otherwise, we shift the register R to the output port. Note that we connect the ReLU module to all the neurons and run them in parallel at each of the two layers.
Maximum classification module: This module receives and classifies the 10 neurons of the output layer. Then, it generates the maximum value corresponding to the predicted result from the input image. Since the possible expected outputs are values in the range of 0 to 9, we encoded it in an 8-bit format for possible extension of the architecture to recognize the alphabetic characters. In our FPGA implementation, we displayed this value in the LEDs on the Pynq-Z2 used board.

For more precision, we indicate in Table 2 the width of the signals used in the input and output of each VHDL module.

Figure 6. Architecture of parallel stage 1, including seven average pooling calculation.

Figure 7. Architecture of parallel stage 2, including the 32 cumulative multiplication, pipeline registers, and bias addition.

Figure 8. Architecture of parallel stage 3, including the 10 cumulative multiplication, pipeline registers, and bias addition.

To test the implemented architecture, we randomly extracted an image example from the test part of the MNIST-used dataset, and we saved it in a BRAM unit in the used FPGA. At the same time, we saved the learned parameters from the Jupyter Notebook software platform (version 7.0.8) to the FPGA ROM as we discussed before. Then, we launched the implemented architecture. Note that for the Hardware Description Language (HDL) implementation of the proposed architecture, we used Vivado 2022.1 tools from AMD Xilinx. To test the final output of the tested example, we planned a behavior simulation using the ISIM simulator from Vivado. The simulation succeeded and the results are shown in Figure 9. As we can see, the best output (the signal

d i g i t_r e s u l t

= 4

for an input image containing the handwritten character 4) is generated by the hardware architecture, proving it functions just like the software version. However, the advantage of the FPGA design is the acceleration of the network calculations according to the high parallel execution and the low power consumption guaranteed by FPGA technology. More precisely, the FPGA-implemented design allows the processing of an input image and recognizing the number it contains in 2.192 µs (indicated by the yellow marker in Figure 9), while consuming only 0.36 Watts. In addition, the implemented design requires 20758 LUTs, 4426 FFs, 42 DSPs, and 3.5 Block RAMs. Therefore, we conclude that the timing and hardware performance reached by the proposed architecture are suitable for use in real-time to recognize handwritten digits in our daily lives.

5. Comparison and Discussion

In this section, we compare the proposed architecture’s performance with that of previous FPGA implementations in terms of logic resources, timing metrics, and power consumption and discuss the results.

When comparing our FPGA implementation to the literature, it can be quickly remarked that it rivals the existing implementations in terms of both the timing metrics and hardware logic resources, as proved by the comparison given in Table 3. More precisely, by considering the compromise between the used LUT, FF, and DSP, it is evident that the proposed architecture requires less hardware area in the used FPGA. We can also remark that it uses only 3.5 Block RAMs, unlike the existing implementations, which require at least 32 BRAMs in [40]. The reduced area is due to the fewer neurons (196 input pixels, 32 hidden neurons, and 10 output neurons) in the model, which reduces the required LUTs to only those needed for calculating the pooling function, ReLU activation function, maximum threshold function, and updating the weights with bias layers’ additions. The DSPs are used to perform the accumulated multiplication to generate the output of the hidden layer (32 DSPs for 32 multiplications) and the output layer (10 DSPs for 10 multiplications). Similarly, FFs are used to save intermediate values of the network parameters and pipeline the layer connections, and BRAMs are used to store the learned parameters (weights and biases). For more details on resource utilization and optimization strategies for FPGA implementations, see [41].

For the timing characteristics, we succeeded in implementing a system for recognizing a handwritten digit with the best possible speed. Our design is 15.62 times faster than the best one in the literature [13]. By considering the software execution time estimated as 54,720 µs (6840 cycles of 8 µs) in a sequential CPU device according to Equation (4) (

14 \times 14

cycles for polling,

(196 + 1) \times 32

cycles for the input layer neurons calculation,

(32 + 1) \times 10

cycles for the output layer neurons calculation, and 10 cycles to generate the expected value), we find an acceleration factor of 24.96, which is very promising for hardware acceleration systems.

S o f t_e x e c_t i m e = [(14 \times 14) + (196 + 1) \times 32 + (32 + 1) \times 10 + 10] \times T,

(4)

Note that

T = 8 ns

is the available clock in the Pynq-Z2 board used. The achieved speedup is due, on the one hand, to the three parallel processing steps described in Section 4. The first one performs seven parallel pooling functions. The second one performs 32 parallel multiplications. The last one performs 10 parallel multiplications. On the other hand, the pipeline steps between the model layers. This considerably reduces the execution time compared to a sequential CPU-based implementation. For more understanding of timing optimization techniques in FPGA implementations, see, for example [42].

In addition, for power consumption, we created a design that requires only 0.36 Watts to process as generated in the Vivado report after place and root steps, which indicate that 70% of the energy is dynamic and only 30% is static. Therefore, it consumes 2.7 times less power than the most efficient implementation in the literature [37], and 53.64 times less than the most energy-consuming one [13]. The energy gain in our design is due to the Data-Flow optimization technique. We reduced the number of memory accesses, we kept the computation localized within on-chip resources (BRAM), we defined the learned parameters (weights and biases) as fixed and accessible in read-only mode, and we reserved the energy required to fetch data from external storage only to load the image. Consequently, this technique of localizing computations and minimizing data movement reduces the need for energy and leads to lower power consumption. Furthermore, it can be better than a GPU-based implementation, as found in the literature [43].

Note that other implementations have been tested to accelerate processing. For example, by executing four parallel multiplications for each neuron, we reached an execution time of 0.824 µs, which is 41.55 times faster than the best time in the literature [13]. However, this accelerated design requires more logic resources according to the used strategy. If we prefer using LUTs, it consumes 44,622 LUTs, 9709 FFs, and 6 DSPs. If we prefer using DSPs, it occupies 20,381 LUTs, 4451 FFs, and 246 DSPs. The simulation result is given in Figure 10, where we can clearly see the required time (indicated by the yellow marker) to give the output expected digit (signal

d i g i t_r e s u l t

= 4

in this example). Therefore, this second parallel implementation can offer more flexibility in choosing an architecture suited for the application. The user needs to consider the trade-off between the application’s required execution time and the available logic resources.

Before concluding this discussion, we specify that the proposed architecture based on implementing an MLP model is more suitable for low-resolution applications like handwritten recognition due to its simplicity and low area resource requirement compared to more sophisticated models like CNNs and Transformers that require more computation per layer due to their specialized architectures (convolutions, attention mechanisms). These models often lead to better performance but are adapted for more complex tasks like image classification, object detection, and segmentation.

Finally, we can conclude that the results obtained from this work are promising for designing applications intended for real-time handwriting recognition based on an optimized embedded system. The proposed architecture can be easily scaled to different FPGA devices because the design and instantiation of all the internal functions were done using the behavioral and structural VHDL language. Consequently, this facilitates using different FPGA devices. For example, a small FPGA can easily be integrated into a camera to achieve this main objective.

Table 3. FPGA implementation and comparison results.

Work	Device	LUT	FF	DSP	BRAM	Exec Time	Feq (Mhz)	Power
Al-Khaleel et al. [36]	Spartan-6	56,864	1596	-	-	-	160.7	-
Ahn [40]	KC705	42,616	-	326	31.5	-	-	-
Yilmazi et al. [44]	Virtex 7	79,322	9243	134	-	-	200	-
Yu et al. [13]	ALveo U50	72,813	36,663	0	234	34.24 µs	300.3	18.59 W
Wang et al. [38]	Cyclone IV	24,245	16,356	93	-	-	-	-
Moradi et al. [33]	Spartan 3AN	-	-	-	-	-	112	-
Khan et al. [31]	Max+ II	-	-	-	-	72.96 µs	4.36	-
Giardino et al. [37]	XC7A100T	15,796	106,400	-	73	41 µs	300	0.975 W
This work	Zynq 7	20,758	4426	42	3.5	2.192 µs	125	0.364 W

6. Conclusions and Future Work

In this paper, we designed a handwritten digit recognition system based on a three-layer MLP NN model. We trained the system on the Jupyter Notebook software platform to extract the learned parameters (weights and biases). After that, we introduced the internal functions of the algorithm and explained their design on the FPGA device. The system was implemented on the Xilinx Zynq 7 FPGA of the Pynq-Z2 board, where it occupies 20,758 LUTs, 4426 FFs, 3.5 BRAM, and 42 DSPs. The hardware implementation achieves fast recognition with only a time of 2.192 µs, which can be improved to 0.824 µs by using more hardware resources. In addition, it provides an economized design with a low power consumption of only 0.36 W. This performance is obtained by pipelining the model layers and exploring the high parallel processing capabilities of FPGA devices, as demonstrated in recent years in embedded systems. Software and hardware analysis hypotheses are proven by simulation tests (functional at the behavioral level) and practical tests (onboard). An acceleration ratio of 41.55 compared to the software CPU-based system proves the efficiency of the proposed design. The experimental results surpass work in the literature, and they are promising and facilitate the use of the proposed design in various applications intended for real-time handwriting recognition based on an embedded system.

In future works, first, we will explore designing a PYNQ overlay for the hardware architecture using block diagrams and IP cores from Vivado, then control the architecture using newly designed drivers based on the Python programming language. PYNQ’s Python interface enables rapid prototyping for our applications on an FPGA to perform faster tests using the same performance. Second, we will extend the architecture to provide recognition of different character types and rewrite poorly written (difficult to read) texts more legibly by retraining the model on the EMNIST and noisier datasets. Third, we considered in this work only the off-chip learning technique. Therefore, to extend the usefulness of this technique, we will explore the possibility of extending it to online learning applications.

Author Contributions

Conceptualization, M.M.; Methodology, M.M.; Software, M.M.; Validation, M.M.; Formal analysis, M.M.; Investigation, M.M.; Resources, M.M.; Writing—original draft, M.M.; Writing—review and editing, M.M.; Visualization, M.M. and E.-B.B.; Supervision, M.M. and E.-B.B.; Project administration, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The MNIST dataset used in this study is publicly available at http://yann.lecun.com/exdb/mnist/ (accessed on 18 November 2025) In our case, we used Jupyter Notebook to load it directly into our code with a single line using standard libraries:(from tensorflow.keras.datasets import mnist).

Acknowledgments

The authors would like to thank AMD Xilinx Inc. for providing the Pynq-Z2 FPGA hardware platform used via the Xilinx University Program, which made this project possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial Neural Networks
CPU	Central Processing Unit
FPGA	Field Programmable Gate Arrays
DSP	Digital Signal Processing
MLP	Multi-Layer Perceptron
VHDL	VHSIC Hardware Description Language
AI	Artificial Intelligence
GPU	Graphics Processing Units
ROM	Read-Only Memory
CNN	Convolution Neural Network
OCR	Optical Character Recognition
ICFHR	International Conference on Frontiers of Handwriting Recognition
ICDAR	International Conference on Document Analysis and Recognition
SVM	Support Vector Machines
AOCR	Arabic Optical Character Recognition
DBN	Deep Belief Network
ReLU	Rectified Linear Unit
MSE	Mean Squared Error
HDL	Hardware Description Language

References

Morris, G. Central processing unit (CPU). In Encyclopedia of Computer Science 2003; John Wiley & Sons: Chichester, UK, 2003; pp. 199–200. [Google Scholar]
Kehoe, P.; Smeaton, A.F. Using Graphics Processor Units (GPUs) for Automatic Video Structuring. In Proceedings of the Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS ’07), Santorini, Greece, 6–8 June 2007; p. 18. [Google Scholar] [CrossRef]
Wang, Y.; Gao, L.; Yang, H. FPGA Programmable Logic Block Architecture with High-Density MAC for Deep Learning Inference. Electronics 2026, 15, 801. [Google Scholar] [CrossRef]
Madani, M.; Assad, S.E.; Dridi, F.; Lozi, R. Enhanced design and hardware implementation of a chaos-based block cipher for image protection. J. Differ. Equ. Appl. 2022, 29, 1408–1428. [Google Scholar] [CrossRef]
Sadeghi, S.; Cpi, P. A Comprehensive Review of Digital Signal Processing (DSP) Algorithms and Their Applications in Telecommunication and Wireless Communication Systems. Int. J. Eng. Technol. Sci. 2025, 2025, 60. [Google Scholar]
Norgbe, C.; Madani, M.; Bourennane, E.B. Privacy-Preserving for Medical Images Using Cryptosystem and Convolutional Autoencoder. In Proceedings of the 2025 9th International Conference on Computer, Software and Modeling (ICCSM), Rome, Italy, 3–5 July 2025; pp. 40–45. [Google Scholar] [CrossRef]
Ramchoun, H.; Idrissi, M.; Ghanou, Y.; Ettaouil, M. Multilayer Perceptron: Architecture Optimization and Training. Int. J. Interact. Multimed. Artif. Intell. 2016, 4, 26–30. [Google Scholar] [CrossRef]
Hojjat, K. MNIST Dataset. Available online: https://www.kaggle.com/datasets/hojjatk/mnist-dataset (accessed on 18 May 2026).
TUL Embedded. PYNQ-Z2 Development Board—Product Specification. Available online: https://www.tulembedded.com/fpga/ProductsPYNQ-Z2.html (accessed on 18 May 2026).
Madani, M.; Benkhaddra, I.; Tanougast, C.; Chitroub, S.; Sieler, L. Digital Implementation of an Improved LTE Stream Cipher SNOW 3G based on Hyperchaotic PRNG. In Security and Communication Networks; John Wiley & Sons: Hoboken, NJ, USA, 2017; Volume 2017, 15p. [Google Scholar] [CrossRef]
Cherifi, R.; Madani, M. Secure and Efficient Tele-Radiography Based on the Fusion of a Convolutional Autoencoder and Chaotic Latent Encryption. J. Image Graph. 2026, 14, 49–57. [Google Scholar] [CrossRef]
Preetha, S.; Afrid, I.M.; Karthik Hebbar, P.; Nishchay, S.K. Machine Learning for Handwriting Recognition. Int. J. Comput. 2020, 38, 93–101. [Google Scholar]
Yu, K.; Kim, M.; Choi, J.R. Memory-Tree Based Design of Optical Character Recognition in FPGA. Electronics 2023, 12, 754. [Google Scholar] [CrossRef]
AlKendi, W.; Gechter, F.; Heyberger, L.; Guyeux, C. Advancements and Challenges in Handwritten Text Recognition: A Comprehensive Survey. J. Imaging 2024, 10, 18. [Google Scholar] [CrossRef]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Surya, N.R.; Afseena, S. Handwritten Character Recognition—A Review. Int. J. Sci. Res. Publ. 2015, 5, 3. [Google Scholar]
Aljarrah, I.A.; Al-Khaleel, O.D.; Mhaidat, K.M.; Alrefai, M.; Alzu’bi, A.; Rabab’ah, M. Automated System for Arabic Optical Character Recognition with Lookup Dictionary. J. Emerg. Technol. Web Intell. 2012, 4, 362–370. [Google Scholar] [CrossRef][Green Version]
Inad, A.; Osama, A.K.; Khaldoon, M.; Mu’ath, A.; Abdullah, A.; Mohammad, R. Automated system for Arabic optical character recognition. In Proceedings of the Association for Computing Machinery, New York, NY, USA, 11–14 November 2012. [Google Scholar] [CrossRef]
Haidar, A.; John, G.; Hisham, A. A Real-time DSP-Based Optical Character Recognition System for Isolated Arabic characters using the TI TMS320C6416T. In Proceedings of the 2008 IAJC-IJME International Conference, Nashville, TN, USA, 17–19 November 2008. [Google Scholar]
Bhagat, M.S.S.; Joshi, M.A.R.; Gajbhiye, M.V.S.; Nandanwar, M.S.R.; Ingle, P.M. Handwritten Character Detection Using Optical Character Recognition Method. Int. J. Res. Appl. Sci. Eng. Technol. 2018, 6, 4724–4726. [Google Scholar] [CrossRef]
Rajabi, M.; Nematbakhsh, N.; Monadjemi, A. A New Decision Tree for Recognition of Persian Handwritten Characters. Int. J. Comput. Appl. 2012, 44, 52–58. [Google Scholar] [CrossRef]
Obaid, A.; El-Bakry, H.; Eldosuky, M.; Shehab, A. Handwritten Text Recognition System based on Neural Network. Int. J. Adv. Res. Comput. Sci. Technol. 2016, 4, 72–77. [Google Scholar]
Meenu, M.; Jyothi, R.L. Handwritten Character Recognition: A Comprehensive Review on Geometrical Analysis. Osr. J. Comput. Eng. (IOSR-JCE) 2015, 17, 83–88. [Google Scholar]
Rao, P.S.; Aditya, J.N.H.S. Handwriting Recognition—“ Offline ” Approach. In Proceedings of the Research School of Computer Science, Stockholm, Sweden, 21–25 June 2014. [Google Scholar]
Rosyda, S.S.; Purboyo, T.W. A Review of Various Handwriting Recognition Methods. Int. J. Appl. Eng. Res. 2018, 13, 1155–1164. [Google Scholar]
Zhu, W. Classification of MNIST Handwritten Digit Database using Neural Network. In Proceedings of the Research School of Computer Science 2018, Canberra, Australia, 20 July 2018; Available online: https://api.semanticscholar.org/CorpusID:202741200 (accessed on 18 May 2026).
Shamim, S.M.; Miah, M.B.; Sarker, A.; Rana, M.; Jobair, A. Handwritten Digit Recognition Using Machine Learning Algorithms. Glob. J. Sci. Technol. 2018, 18, 29–39. [Google Scholar] [CrossRef]
Darmatasia; Fanany, M.I. Handwriting recognition on form document using convolutional neural network and support vector machines (CNN-SVM). In Proceedings of the 2017 5th International Conference on Information and Communication Technology (ICoIC7), Melaka, Malaysia, 17–19 May 2017; pp. 1–6. [Google Scholar]
Fahmy, M.M.M.; Ali, S.A. Automatic recognition of handwritten arabic characters using their geometrical features. Stud. Inform. Control 2001, 10, 81–98. [Google Scholar]
Ali, A.H.; Mohammed, M.A.; Ahmed, M.A. Character Recognition By Implementing FPGA-Based Artificial Neural Network. Mesopotamian J. Comput. Sci. 2021, 2021, 13–17. [Google Scholar] [CrossRef]
Khan, F.; Uppal, M.; Song, W.C.; Kang, M.J.; Mirza, A. FPGA Implementation of a Neural Network for Character Recognition. Adv. Neural Netw. 2006, 2006, 1357–1365. [Google Scholar] [CrossRef]
Rahardjo, P.M.; dan Nanang Sulistyanto, M.R. The Implementation of Feedforward Backpropagation Algorithm for Digit Handwritten Recognition in a Xilinx Spartan-3. J. EECCIS 2010, IV, 2. [Google Scholar]
Moradi, M.; Pourmina, M.A.; Razzazi, F. FPGA-Based Farsi Handwritten Digit Recognition System. Int. J. Simul. Syst. Sci. Technol. 2010, 11, 17–22. [Google Scholar]
Toosizadeh, N.; Eshghi, M. Design and implementation of a new persian digits ocr algorithm on fpga chips. In Proceedings of the 13th Conference, European Signal Processing (EUSIPCO2005), Antalya, Turkey, 4–8 September 2005. [Google Scholar]
Al-Marakeby, A.; Kimura, F.; Zaki, M.; Rashid, A. Design of an Embedded Arabic Optical Character Recognition. J. Signal Process. Syst. 2013, 70, 249–258. [Google Scholar] [CrossRef]
Al-Khaleel, O.; Aljarrah, I.; Idries, A.; Mhaidat, K. Hardware Implementation of Web Based Arabic Optical Character Recognition Units. J. Emerg. Technol. Web Intell. 2014, 6, 210–219. [Google Scholar] [CrossRef]
Giardino, D.; Matta, M.; Silvestri, F.; Spanò, S.; Trobiani, V. FPGA Implementation of Hand-written Number Recognition Based on CNN. Int. J. Adv. Sci. Eng. Inf. Technol. 2019, 9, 167–171. [Google Scholar] [CrossRef]
Wang, L.; Yang, Z.; Xu, G.R.; lan Fu, M.; Wang, Y. Design of FPGA-based Handwriting Image Recognition System. Adv. Model. Anal. B 2017, 60, 426–437. [Google Scholar] [CrossRef]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Ahn, B. Real-time video object recognition using convolutional neural network. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–7. [Google Scholar] [CrossRef]
Kokkinis, A.; Siozios, K. Fast Resource Estimation of FPGA-Based MLP Accelerators for TinyML Applications. Electronics 2025, 14, 247. [Google Scholar] [CrossRef]
Zhao, Y.; Li, M.; Zhang, Y.; Lin, Q.; Chen, Z. Research on FPGA timing optimization methods with large on-chip memory resource utilization in PCIe DMA. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; pp. 1–4. [Google Scholar] [CrossRef]
Giasemis, F.I.; Lončar, V.; Granado, B.; Gligorov, V.V. Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb. In Proceedings of the 2025 23rd IEEE Interregional NEWCAS Conference (NEWCAS), Paris, France, 22–25 June 2025; Available online: http://arxiv.org/abs/2502.02304 (accessed on 26 May 2026).
Yilmaz, A.R.; Erkmen, B.; Yavuz, O. Accelerating handwritten signature recognition using intelligent algorithm based embedded system. Sigma J. Eng. Nat. Sci. Sigma MüHendislik Ve Fen Bilim. Derg. 2016, 34, 393–405. [Google Scholar]

Figure 1. Architecture of a single neural network layer, including the input pixels

X_{m}

, the weights

W_{m}

, the bias B, and the ReLU activation function [7].

Figure 1. Architecture of a single neural network layer, including the input pixels

X_{m}

, the weights

W_{m}

, the bias B, and the ReLU activation function [7].

Figure 2. External architecture of the multi-layer perceptron neural network, including an input layer, a hidden layer, and an output layer.

Figure 3. Average pooling step calculating the mean value from four pixels.

Figure 4. Detailed architecture of the used muli-layer perceptron showing the connection of the input layer, the hidden layer, and the output layer with the model parameters (

a_{n}

,

a_{p}

,

z_{n}

,

z_{p}

).

Figure 4. Detailed architecture of the used muli-layer perceptron showing the connection of the input layer, the hidden layer, and the output layer with the model parameters (

a_{n}

,

a_{p}

,

z_{n}

,

z_{p}

).

Figure 5. Internal architecture of the implemented multi-layer perceptron neural network inside the FPGA device.

Figure 9. Vivado simulation result of the accelerated system 1.

Figure 10. Vivado simulation result of the accelerated system 2.

Table 1. Comparative summary of analyzed handwriting recognition methods.

Category	References	Strong Sides (Advantages)	Weak Sides (Limitations)
Literature surveys	[12,13,14]	Comprehensive data compilation	Absence of hardware metrics
		Global methodology overviews	No direct algorithmic validation
		Multi-lingual dataset indexing	Restricted to high-level analysis
Traditional OCR and structural pipelines	[17,18,20,23,24]	Low-complexity geometric logic	Manual feature engineering required
		Effective background noise removal	Weak adaptation to custom styles
		Deterministic and fast execution	Highly sensitive to font distortions
Embedded software	[19,22,29]	High algorithmic flexibility	Strict sequential code execution
		Straightforward software patching	Limited computational throughput
		Low development abstraction barrier	Hardware functional bottlenecks
Deep learning models	[15,16,25,26,27,28]	State-of-the-art classification accuracy	High computational footprint
		Automated feature extraction	Extensive memory consumption
		Robustness against input noise	Requires high-end CPU/GPU nodes
Early Hardware and Lightweight FPGA Matrix Models	[21,30,31,32,33]	Deterministic execution latency	Restricted to tiny binary grids
		High internal parallel processing	Oversimplified neural topologies
		Low operating energy consumption	Low generalization capabilities
Modern Customized FPGA Systems	[34,35,36,37,38]	Superior real-time processing throughput	Highly static hardware architecture
		Tailored internal resource balancing	Time-consuming in coding phases
		Optimized power efficiency at the edge	Complex co-design implementation

Table 2. Input–Output width used in each VHDL module.

VHDL Module	Input Width	Output Width
Image load	$28 \times 28 \times 8 -$ bit	$28 \times 28 \times 8 -$ bit
Average pooling	$28 \times 28 \times 8 -$ bit	$14 \times 14 \times 8 -$ bit
Reshape	$14 \times 14 \times 8 -$ bit	$1 \times 196 \times 8 -$ bit
Neurons calculation 1	$32 \times 196 \times 8 -$ bit	$32 \times 32 -$ bit
ReLu 1	$32 \times 32 -$ bit	$32 \times 16 -$ bit
Neurons calculation 2	$32 \times 32 \times 16 -$ bit	$10 \times 32 -$ bit
ReLu 2	$10 \times 32 -$ bit	$10 \times 16 -$ bit
Maximum classification	$10 \times 16 -$ bit	$8 -$ bit

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Madani, M.; Bourennane, E.-B. FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition. Electronics 2026, 15, 2384. https://doi.org/10.3390/electronics15112384

AMA Style

Madani M, Bourennane E-B. FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition. Electronics. 2026; 15(11):2384. https://doi.org/10.3390/electronics15112384

Chicago/Turabian Style

Madani, Mahdi, and El-Bay Bourennane. 2026. "FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition" Electronics 15, no. 11: 2384. https://doi.org/10.3390/electronics15112384

APA Style

Madani, M., & Bourennane, E.-B. (2026). FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition. Electronics, 15(11), 2384. https://doi.org/10.3390/electronics15112384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FPGA-Based Implementation of Artificial Neural Network for Accelerated Handwritten Digit Recognition

Abstract

1. Introduction

2. Analysis of Related Work

Problem Statement and Motivation

3. Architecture of Proposed Multi-Layer Perceptron Model

4. FPGA Design of Proposed Multi-Layer Perceptron Model

5. Comparison and Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI