Efﬁcient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method

: An efﬁcient on-chip learning method based on neuron multiplexing is proposed in this paper to address the limitations of traditional on-chip learning methods, including low resource utilization and non-tunable parallelism. The proposed method utilizes a conﬁgurable neuron calculation unit (NCU) to calculate neural networks in different degrees of parallelism through multiplexing NCUs at different levels, and resource utilization can be increased by reducing the number of NCUs since the resource consumption is predominantly determined by the number of NCUs and the data bit-width, which are decoupled from the speciﬁc topology. To better support the proposed method and minimize RAM block usage, a weight segmentation and recombination method is introduced, accompanied by a detailed explanation of the access order. Moreover, a performance model is developed to facilitate parameter selection process. Experimental results conducted on an FPGA development board demonstrate that the proposed method has lower resource consumption, higher resource utilization, and greater generality compared to other methods.


Introduction
The artificial neural network has wide applications in image recognition, prediction, control engineering, and other fields due to its advantages, such as strong nonlinear mapping ability, model independence, adaptability, and learning capability [1]. There are various forms of neural networks, with the convolutional neural network receiving significant attention in recent years, along with the Transformer-based architecture following the rise of ChatGPT. However, in 2021, Google's MLP-Mixer rekindled interest in the multilayer perceptron (MLP) [2]. The performance of MLP-Mixer rivals that of convolutional or Transformer-based architectures, indicating the untapped potential of MLP [3,4].
Furthermore, MLP remains the most commonly utilized neural network in practical engineering applications, particularly in the field of intelligent control. Its applications encompass modeling, decoupling, parameter estimation, fitting, diagnosis [5][6][7][8], etc. When it comes to implementation platforms, embedded systems are often selected, yet executing code sequentially on chips such as digital signal processors (DSPs) can be time-consuming. In contrast, field-programmable gate arrays (FPGAs) are better suited for neural network implementation due to their parallel structure, which accelerates the calculation process [9]. Consequently, FPGA has emerged as a popular platform for MLP implementation [10][11][12].
Nonetheless, the computation of neural networks implemented on FPGAs can exhaust resources, especially during the learning process, and the limited resources provided by FP-GAs necessitate high resource utilization in the implementation method [13]. Moreover, the topology of neural networks varies across different applications. Since there is no universal formula for topology selection, the trial-and-error method is still the most commonly used strategy [14]. Consequently, the implementation method should possess generality across different topologies and resources. Additionally, the resource consumption of the common implementation methods is usually correlated with the concrete topology, which constrains the application ranges of these methods, especially for the application of neural networks with large topologies implemented on few FPGA resources. Several scalable architectures have been proposed, such as DLAU [15], DeepX [16], FP-DNN [17], etc. However, these accelerators map all operations from different network layers to the same hardware unit, resulting in the same degree of calculation parallel for different neural networks. Therefore, these architectures cannot meet various parallelism requirements of different neural networks or make full use of the reconfigurable characteristics of FPGA [18,19], which will cause the inadequate utilization of hardware resources, especially for the shallow neural networks.
Therefore, to address these challenges, this paper proposes an efficient on-chip learning method for MLP. Instead of solely performing off-chip learning for inference calculation [20], or solely acting as a computer assistant to accelerate part of the neural network calculation [21], the proposed method deploys the entire learning process of neural networks on a single FPGA, which is more friendly for embedded systems. The main contributions of this work are outlined below.
Firstly, an architecture of neural network hardware implementation with the configurable neuron calculation unit (NCU) is proposed, and the parallelism tunable architecture can meet with different parallelism requirements of different applications and increase the resource utilization rate. The proposed method is also highly parameterized to better adapt to different applications. It allows for a flexible configuration of topology, data bit-width, and other parameters.
Secondly, a weight adaptive adjustment strategy is proposed to better support the calculation of the proposed method. The weights are segmented and recombined according to the number of NCUs, thereby reducing the utilization of random-access memory (RAM) blocks.
Thirdly, multiple strategies are used to reduce resource consumption and calculation time. The proposed method adopts a "folded" systolic array wherein data is received serially while NCUs operate in parallel [22]. By this structure, only one memory is needed to store the output or sensitivity, and only one activation function module is utilized throughout the neural network, resulting in a further reduction of resource consumption. Meanwhile, the end time of each calculation stage is optimized to save computation cycles, ensuring fast calculations through a compact time sequence arrangement.
Finally, a performance model of the proposed method is constructed, which can evaluate the effect of different parameters beforehand to accelerate the parameter selection process.
The rest of this paper is organized as follows. Section 2 provides an introduction to the related works. The learning principle of neural networks is presented in Section 3. Section 4 extensively elaborates on the design of the proposed method. The strategy of weight storage and access is explained in Section 5. In Section 6, the performance model of the proposed method is described. The verification and comparison results of the proposed method are shown in Section 7. We then conclude our work in Section 8.

Related Works
Numerous studies have been conducted on the implementation of MLP in FPGA. In 1999, Izeboudjen proposed the parametric programming of neural networks using VHDL code [23]. The structure of the proposed method was parallel at the neuron level and serial at the layer level, allowing for the easy realization of different network topologies by modifying parameters. However, this work only supports the inference calculation of the neural network. In 2007, Izeboudjen extended the idea of parameterization to online training of neural networks [24], which adjusted topology by copying or deleting neural cores. Nevertheless, the number of layers in this work is not adjustable and the configuration steps are complex. To overcome these limitations, a more flexible FPGA implementation method for neural networks was proposed in [25]. This highly parameterized method made and the convenience it offers for parameterization. Moreover, the calculation speed achieved by this method is sufficient to meet the requirements as long as the time sequences are properly arranged. In addition, by performing calculations on one layer at a time, the multiplexing method effectively avoids the issue of backward locking. Consequently, the multiplexing method is selected as the foundation of our research.

The Principle of the MLP
The BP algorithm, which is frequently utilized for training MLP [13], is a gradientdescent-based learning procedure that approximates the expected output by continuously adjusting the weight values according to the error gradient. The calculation process of MLP with the BP learning algorithm can be divided into three phases: forward calculation phase, backward calculation phase, and update phase. The forward calculation phase involves the forward inference calculation, where the input signal is propagated through the network in the forward direction until it reaches the output. The backward calculation phase is responsible for the error backpropagation calculation, during which the error signal propagates through the network in the backward direction until it reaches the input. The update phase focuses on the weight update calculation, where the weights are adjusted based on the forward direction of the network. In Figure 1, 1ˆm VV represent the inputs, which are also the outputs of the previous layer. Wij represents the weight value from neuron i in the previous layer to neuron j in the current layer. b1~bn represent biases. u1~un represent the outputs of multiply-accumulate (MAC), which can be obtained as follows: The output of MAC is the input of the activation function g, where g is commonly chosen as the hyperbolic tangent function: The structure of the output layer is slightly different from Figure 1. The final output is correlated just linearly with the output of MAC: where C = 1 in this paper. Assuming the expectation of the network is d, the error is as follows: In Figure 1,V 1 ∼V m represent the inputs, which are also the outputs of the previous layer. W ij represents the weight value from neuron i in the previous layer to neuron j in the current layer. b 1~bn represent biases. u 1~un represent the outputs of multiply-accumulate (MAC), which can be obtained as follows: The output of MAC is the input of the activation function g, where g is commonly chosen as the hyperbolic tangent function: V j = g(u j ) = (e u j − e −u j )/(e u j + e −u j ) (2) The structure of the output layer is slightly different from Figure 1. The final output is correlated just linearly with the output of MAC: where C = 1 in this paper. Assuming the expectation of the network is d, the error is as follows: Electronics 2023, 12, 3607 5 of 27 In the learning phase, the algorithm adjusts the weight to minimize the mean square error, and the value is as follows: The BP algorithm adjusts the weight with a certain compensation along the opposite direction of the error gradient: where, η is the adjustment step and S is the sensitivity, which can be obtained by the chain rule: where,Ŝ represents the sensitivity value of the previous layer and g is the derivative of the activation function, which can be obtained as follows: The sensitivity value in the last layer is: The update of the weight is shown as follows:

Overview of the Proposed Method
The proposed neuron multiplexing method is an improvement of the layer multiplexing method, and the diagram of the traditional layer multiplexing method is shown in Figure 2a.
In the learning phase, the algorithm adjusts the weight to minimize the mean square error, and the value is as follows: The BP algorithm adjusts the weight with a certain compensation along the opposite direction of the error gradient: where, ƞ is the adjustment step and S is the sensitivity, which can be obtained by the chain rule: where, Ŝ represents the sensitivity value of the previous layer and g is the derivative of the activation function, which can be obtained as follows: The sensitivity value in the last layer is: The update of the weight is shown as follows:

Overview of the Proposed Method
The proposed neuron multiplexing method is an improvement of the layer multiplexing method, and the diagram of the traditional layer multiplexing method is shown in Figure 2a. Since the layer with maximum neurons is selected as the multiplexing layer in the traditional layer multiplexing method, the resource consumption of which is still related to the concrete topology, so the application of the method is limited. In addition, the layer multiplexing method has low resource utilization when the neural number varies largely in different layers. For example, in Figure 3, a neural network with a topology of 3-4-2 is calculated according to the traditional multiplexing method, and the part of the neural network that is executing calculations is represented by solid lines, others are represented by shaded lines. The layer number in the figure represents the layer that is executing calculations. The calculation resources are configured according to the second layer, which has the largest neuron number of four, and it can be observed that two NCUs are idle in the calculation of the last layer, which causes the low resource utilization. To address these issues, this paper proposes a neuron multiplexing method based on the layer multiplexing method, with the goal of achieving better resource utilization and generality. The proposed architecture is composed of configurable NCUs, which are the resources needed to complete the calculation of one neuron, and the calculation of any topologies can be realized through multiplexing NCUs. Figure 2b provides a diagram illustrating the neuron multiplexing method.
Upon observing Figure 2, it can be noted that the layer multiplexing algorithm can be viewed as a special case of the proposed method, wherein the number of NCUs is set to the maximum number of layers. If we consider layer multiplexing as the folding of neural networks in the horizontal dimension, the proposed neuron multiplexing method can be seen as an even further folding of the neural networks in the vertical dimension.
By employing the proposed method, it is possible to decouple the topology from the resources and enhance their utilization. For example, if the case depicted in Figure 3 is processed using the proposed method and the number of NCUs is set to two, the calculation process for this is demonstrated in Figure 4. The stage number in the figure represents the different executing stage of one layer of the proposed method.

Layer2,Stage1
Layer2,Stage2 Layer3 In Figure 4, the number of NCUs is smaller than four, resulting in less resource consumption, and the NCUs are occupied all the time, which increases resource utilization. Since the calculation of layers that have neurons more than the number of configured NCUs will be divided into several stages, the calculation cycles will also increase slightly. The detailed analysis of performance will be shown in Section 6.
The subsequent subsections will introduce the detailed implementation design of the proposed method. To provide a clear introduction, we denote the neural number of the previous layer as Ni−1, the neural number of the current layer as Ni, and the number of NCUs as k. To address these issues, this paper proposes a neuron multiplexing method based on the layer multiplexing method, with the goal of achieving better resource utilization and generality. The proposed architecture is composed of configurable NCUs, which are the resources needed to complete the calculation of one neuron, and the calculation of any topologies can be realized through multiplexing NCUs. Figure 2b provides a diagram illustrating the neuron multiplexing method.
Upon observing Figure 2, it can be noted that the layer multiplexing algorithm can be viewed as a special case of the proposed method, wherein the number of NCUs is set to the maximum number of layers. If we consider layer multiplexing as the folding of neural networks in the horizontal dimension, the proposed neuron multiplexing method can be seen as an even further folding of the neural networks in the vertical dimension.
By employing the proposed method, it is possible to decouple the topology from the resources and enhance their utilization. For example, if the case depicted in Figure 3 is processed using the proposed method and the number of NCUs is set to two, the calculation process for this is demonstrated in Figure 4. The stage number in the figure represents the different executing stage of one layer of the proposed method. in different layers. For example, in Figure 3, a neural network with a topology of 3-4-2 is calculated according to the traditional multiplexing method, and the part of the neural network that is executing calculations is represented by solid lines, others are represented by shaded lines. The layer number in the figure represents the layer that is executing calculations. The calculation resources are configured according to the second layer, which has the largest neuron number of four, and it can be observed that two NCUs are idle in the calculation of the last layer, which causes the low resource utilization. To address these issues, this paper proposes a neuron multiplexing method based on the layer multiplexing method, with the goal of achieving better resource utilization and generality. The proposed architecture is composed of configurable NCUs, which are the resources needed to complete the calculation of one neuron, and the calculation of any topologies can be realized through multiplexing NCUs. Figure 2b provides a diagram illustrating the neuron multiplexing method.
Upon observing Figure 2, it can be noted that the layer multiplexing algorithm can be viewed as a special case of the proposed method, wherein the number of NCUs is set to the maximum number of layers. If we consider layer multiplexing as the folding of neural networks in the horizontal dimension, the proposed neuron multiplexing method can be seen as an even further folding of the neural networks in the vertical dimension.
By employing the proposed method, it is possible to decouple the topology from the resources and enhance their utilization. For example, if the case depicted in Figure 3 is processed using the proposed method and the number of NCUs is set to two, the calculation process for this is demonstrated in Figure 4. The stage number in the figure represents the different executing stage of one layer of the proposed method.

Layer2,Stage1
Layer2,Stage2 Layer3 In Figure 4, the number of NCUs is smaller than four, resulting in less resource consumption, and the NCUs are occupied all the time, which increases resource utilization. Since the calculation of layers that have neurons more than the number of configured NCUs will be divided into several stages, the calculation cycles will also increase slightly. The detailed analysis of performance will be shown in Section 6.
The subsequent subsections will introduce the detailed implementation design of the proposed method. To provide a clear introduction, we denote the neural number of the previous layer as Ni−1, the neural number of the current layer as Ni, and the number of NCUs as k. In Figure 4, the number of NCUs is smaller than four, resulting in less resource consumption, and the NCUs are occupied all the time, which increases resource utilization. Since the calculation of layers that have neurons more than the number of configured NCUs will be divided into several stages, the calculation cycles will also increase slightly. The detailed analysis of performance will be shown in Section 6.
The subsequent subsections will introduce the detailed implementation design of the proposed method. To provide a clear introduction, we denote the neural number of the previous layer as N i−1 , the neural number of the current layer as N i , and the number of NCUs as k.

The Top-Level Module
The top-level module mainly contains the interfaces with the outside, which is shown in Figure 5.

The Top-Level Module
The top-level module mainly contains the interfaces with the outside, which is shown in Figure 5.   At the top-level module, the samples, initial weights, and expectations are inputs from external sources, which can be acquired through the retrieval of corresponding tables. The inner modules generate the addresses of these tables, which are then output from the top-level module. The "TrainFlag" signal is utilized to determine whether to execute the interface-only algorithm or the entire training process. This determination is manifested through the state machine depicted within the dashed rectangle, when the signal on the arrow is valid, the state machine performs the corresponding transition. If "TrainFlag" is zero, the state machine will transition directly to "State = 1" upon completion of "State = 2"; otherwise, it will proceed to "State = 3" and carry out the entire calculation. (Refer to Table 1 for a description of the state machine.) Updating phase.
The "NewCycleStart" signal is the start signal of the new cycle, which is determined by the state machine and the enable signal from outside. "Out" and "Error" signals are the output and error after inference is completed.

Multipliers and Weight Storage
To make the design clearer, the forward calculation phase, backward calculation phase, and update phase correspond to three independent modules, namely, forward calculation block, backward calculation block, and update block. In all three phases, multipliers and weights are utilized and multiplexed to minimize resource usage. The interconnections between the multipliers, weight storage, and other blocks can be found in Figure  6. At the top-level module, the samples, initial weights, and expectations are inputs from external sources, which can be acquired through the retrieval of corresponding tables. The inner modules generate the addresses of these tables, which are then output from the top-level module. The "TrainFlag" signal is utilized to determine whether to execute the interface-only algorithm or the entire training process. This determination is manifested through the state machine depicted within the dashed rectangle, when the signal on the arrow is valid, the state machine performs the corresponding transition. If "TrainFlag" is zero, the state machine will transition directly to "State = 1" upon completion of "State = 2"; otherwise, it will proceed to "State = 3" and carry out the entire calculation. (Refer to Table 1 for a description of the state machine.) The "NewCycleStart" signal is the start signal of the new cycle, which is determined by the state machine and the enable signal from outside. "Out" and "Error" signals are the output and error after inference is completed.

Multipliers and Weight Storage
To make the design clearer, the forward calculation phase, backward calculation phase, and update phase correspond to three independent modules, namely, forward calculation block, backward calculation block, and update block. In all three phases, multipliers and weights are utilized and multiplexed to minimize resource usage. The interconnections between the multipliers, weight storage, and other blocks can be found in Figure 6.
In Figure 6, thick lines represent signal clusters composed of multiple signals. Considering that the multipliers and weights serve as public resources with numerous connections to other modules, they are not designed as separate modules and the connections are not present in the figure but are represented by dashed lines instead. All signal names starting with "F", "B", and "U" indicate signals originating from the forward calculation block, backward calculation block, and update block, respectively.  In Figure 6, thick lines represent signal clusters composed of multiple signals. Considering that the multipliers and weights serve as public resources with numerous connections to other modules, they are not designed as separate modules and the connections are not present in the figure but are represented by dashed lines instead. All signal names starting with "F", "B", and "U" indicate signals originating from the forward calculation block, backward calculation block, and update block, respectively.
The inputs of the multipliers switch based on the state signal, enabling multiplexing of the multipliers during different phases. With the exception of the multiplication within the activation function module and the constant multiplication within the update module, all other multiplications are accomplished using the multipliers depicted in Figure 6. As a result, the total number of multipliers utilized throughout the algorithm is only two more than the number of NCUs.
The output of the multipliers is truncated to prevent excessive resource consumption in accordance with the configured data bit-width. To reduce the critical path delay, two cycles of delay are introduced after the output. Additionally, the multipliers offer flexible configuration options, allowing them to be mapped to either DSP slices or LUT slices of the FPGA, with the former being the default mapping choice.
The "Weight Rd&Wr" module consists of several dual port RAMs to store weights, which can support simultaneous reading and writing. The module is automatically mapped to distributed RAM or block RAM determined according to the storage size. The read addresses of the RAMs are switched according to the state signal, while the writerelevant signals are switched according to the end signal of weight initialization, as the writing of weights is only required in the weight initialization or update phase. The weight initialization module is used to generate the reading and writing-related signals of weights in the weight initialization phases.

Forward Calculation Block
The forward calculation block is responsible for performing the forward inference calculation of the neural network, which corresponds to Equation (1) through to (4). The diagram of this block is depicted in Figure 7. The inputs of the multipliers switch based on the state signal, enabling multiplexing of the multipliers during different phases. With the exception of the multiplication within the activation function module and the constant multiplication within the update module, all other multiplications are accomplished using the multipliers depicted in Figure 6. As a result, the total number of multipliers utilized throughout the algorithm is only two more than the number of NCUs.
The output of the multipliers is truncated to prevent excessive resource consumption in accordance with the configured data bit-width. To reduce the critical path delay, two cycles of delay are introduced after the output. Additionally, the multipliers offer flexible configuration options, allowing them to be mapped to either DSP slices or LUT slices of the FPGA, with the former being the default mapping choice.
The "Weight Rd&Wr" module consists of several dual port RAMs to store weights, which can support simultaneous reading and writing. The module is automatically mapped to distributed RAM or block RAM determined according to the storage size. The read addresses of the RAMs are switched according to the state signal, while the write-relevant signals are switched according to the end signal of weight initialization, as the writing of weights is only required in the weight initialization or update phase. The weight initialization module is used to generate the reading and writing-related signals of weights in the weight initialization phases.

Forward Calculation Block
The forward calculation block is responsible for performing the forward inference calculation of the neural network, which corresponds to Equation (1) through to (4). The diagram of this block is depicted in Figure 7.
The start signal of the block derives from the end signal of the previous stage, or from the external start signal of the new cycle for the first stage of the first layer. When the start signal arrives, the status of layers and stages will be updated first. A counter named "Counter1" will start at the same time, which increases from 0 to N i−1 . The address offset of weight and output V can be obtained by looking up the corresponding offset table according to the current status of layers and stages. The address of weight and V can be then obtained by adding the counter and the offset value.
The read address of weight "F_W_RdAddr" will be output to Figure 6 after delays to access the RAM of weight, and the accessed weight "Wout" is input to the multipliers in Figure 6. The read address of V is input to the RAM of V through a selector, as the V is also used in the backward and update phases. The accessed value of V is also input to the multipliers in Figure 6, and then multiple multiplication results "MulOut" will be obtained in parallel and are inputted to the accumulation module through "In2". The signal flow of MAC is shown in Figure 8.  Figure 7. Diagram of the forward calculation block.
The start signal of the block derives from the end signal of the previous stage, or from the external start signal of the new cycle for the first stage of the first layer. When the start signal arrives, the status of layers and stages will be updated first. A counter named "Counter1" will start at the same time, which increases from 0 to Ni−1. The address offset of weight and output V can be obtained by looking up the corresponding offset table according to the current status of layers and stages. The address of weight and V can be then obtained by adding the counter and the offset value.
The read address of weight "F_W_RdAddr" will be output to Figure 6 after delays to access the RAM of weight, and the accessed weight "Wout" is input to the multipliers in Figure 6. The read address of V is input to the RAM of V through a selector, as the V is also used in the backward and update phases. The accessed value of V is also input to the multipliers in Figure 6, and then multiple multiplication results "MulOut" will be obtained in parallel and are inputted to the accumulation module through "In2". The signal flow of MAC is shown in Figure 8. Figure 8. Signal flow of MAC.
The architecture illustrated in Figure 8 can be regarded as a "folded" systolic array [22], which exhibits similarity to a conventional systolic array when the Multiply-Accumulate units (MACs) of the architecture are vertically unfolded. The MAC is comprised of the multipliers depicted in Figure 6 and the accumulation module within the forward calculation block. The input traverses through the MACs sequentially, effectively supporting the storage of a single value V in one RAM for each clock cycle.
The values of "MacOut" are converted into serials through Selector2 to reduce the usage of the activation function module and facilitate the storage of V. The design of the activation function is our previous work, so it will be only briefly introduced here. The The start signal of the block derives from the end signal of the previous stage, or from the external start signal of the new cycle for the first stage of the first layer. When the start signal arrives, the status of layers and stages will be updated first. A counter named "Counter1" will start at the same time, which increases from 0 to Ni−1. The address offset of weight and output V can be obtained by looking up the corresponding offset table according to the current status of layers and stages. The address of weight and V can be then obtained by adding the counter and the offset value.
The read address of weight "F_W_RdAddr" will be output to Figure 6 after delays to access the RAM of weight, and the accessed weight "Wout" is input to the multipliers in Figure 6. The read address of V is input to the RAM of V through a selector, as the V is also used in the backward and update phases. The accessed value of V is also input to the multipliers in Figure 6, and then multiple multiplication results "MulOut" will be obtained in parallel and are inputted to the accumulation module through "In2". The signal flow of MAC is shown in Figure 8. The architecture illustrated in Figure 8 can be regarded as a "folded" systolic array [22], which exhibits similarity to a conventional systolic array when the Multiply-Accumulate units (MACs) of the architecture are vertically unfolded. The MAC is comprised of the multipliers depicted in Figure 6 and the accumulation module within the forward calculation block. The input traverses through the MACs sequentially, effectively supporting the storage of a single value V in one RAM for each clock cycle.
The values of "MacOut" are converted into serials through Selector2 to reduce the usage of the activation function module and facilitate the storage of V. The design of the activation function is our previous work, so it will be only briefly introduced here. The The architecture illustrated in Figure 8 can be regarded as a "folded" systolic array [22], which exhibits similarity to a conventional systolic array when the Multiply-Accumulate units (MACs) of the architecture are vertically unfolded. The MAC is comprised of the multipliers depicted in Figure 6 and the accumulation module within the forward calculation block. The input traverses through the MACs sequentially, effectively supporting the storage of a single value V in one RAM for each clock cycle.
The values of "MacOut" are converted into serials through Selector2 to reduce the usage of the activation function module and facilitate the storage of V. The design of the activation function is our previous work, so it will be only briefly introduced here. The activation function is realized through a hybrid method, the fast-changing region of which is approximated by the method of the lookup table with interpolation [25], and the slow-changing region of which is realized using the range addressable lookup table method [37]. The optimal data bit-width of the method is selected automatically according to the expected accuracy. Compared with other methods, the proposed approximation method can save more hardware resources under the same expected accuracy. Since only one activation function module is used for the entire neural network, its accuracy can be appropriately increased to achieve a more accurate calculation of the neural network.
"Counter2" starts after the "Counter1" ending for a few cycles, which is not only served as the selection signal of "Selector2" but also added with the offset address of V to constitute the write address. The end signal of "Counter1" is also input into the "End Determine" module to determine the end time of both stages and the forward calculation phases. Intuitively, the calculation of each stage should not end until the storage of V is finished. However, the end time can be advanced because the storage can be executed simultaneously with the calculation in the next stage. The determination of the end time should avoid the storage conflict in different stages, which means the storage of output in the current stage should be finished before starting the storage in the next stage. Therefore, the end time in the forward calculation phase is shown in Figure 9a. tivation function module is used for the entire neural network, its accuracy can be appropriately increased to achieve a more accurate calculation of the neural network.
"Counter2" starts after the "Counter1" ending for a few cycles, which is not only served as the selection signal of "Selector2" but also added with the offset address of V to constitute the write address. The end signal of "Counter1" is also input into the "End Determine" module to determine the end time of both stages and the forward calculation phases. Intuitively, the calculation of each stage should not end until the storage of V is finished. However, the end time can be advanced because the storage can be executed simultaneously with the calculation in the next stage. The determination of the end time should avoid the storage conflict in different stages, which means the storage of output in the current stage should be finished before starting the storage in the next stage. Therefore, the end time in the forward calculation phase is shown in Figure 9a.  In Figure 9a, "Counter3" starts as the MAC operation finished and ends as the stage ends. If the next stage is in a new layer, the inputs are outputs of the current layer and an additional bias. Since the number of cycles spent on MAC is just the number of inputs, and the number of cycles needed for storage is one more than the number of outputs in the current stage (one for bias), the cycles spent on the MAC operation in the next stage is not fewer than the cycles needed for storage in the current stage. Considering the inputs in the next layer are obtained through RAM access, the first output must be stored before the new stage starts. Therefore, the end time can be defined at the moment of the first output stored in the RAM, i.e., Couner3 = 3.
If the next stage is not in a new layer, the inputs of the next stage are the outputs of the previous layer with an additional bias. The number of cycles needed for storage in each stage is fewer or equal to k + 1 since the number of NCUs is k and the extra one is reserved for bias storage. When Ni−1 + 1 ≥ k + 1, the cycles spent on MAC in the next stage are still enough for the current output stored in RAM. Due to the outputs in the current stage not being used in the next stage, the end time can be determined at the moment as the MAC is finished, i.e., Couner3 = 1. When Ni−1 < k, the cycles spent on MAC in the next stage are not enough for the current output stored in the RAM, hence the end time has to be delayed to the time of Counter3 = k − Ni−1. In this way, the cycles spent on MAC are just enough for the leftover output stored in RAM, avoiding the storage conflict in the next In Figure 9a, "Counter3" starts as the MAC operation finished and ends as the stage ends. If the next stage is in a new layer, the inputs are outputs of the current layer and an additional bias. Since the number of cycles spent on MAC is just the number of inputs, and the number of cycles needed for storage is one more than the number of outputs in the current stage (one for bias), the cycles spent on the MAC operation in the next stage is not fewer than the cycles needed for storage in the current stage. Considering the inputs in the next layer are obtained through RAM access, the first output must be stored before the new stage starts. Therefore, the end time can be defined at the moment of the first output stored in the RAM, i.e., Couner3 = 3.
If the next stage is not in a new layer, the inputs of the next stage are the outputs of the previous layer with an additional bias. The number of cycles needed for storage in each stage is fewer or equal to k + 1 since the number of NCUs is k and the extra one is reserved for bias storage. When N i−1 + 1 ≥ k + 1, the cycles spent on MAC in the next stage are still enough for the current output stored in RAM. Due to the outputs in the current stage not being used in the next stage, the end time can be determined at the moment as the MAC is finished, i.e., Couner3 = 1. When N i−1 < k, the cycles spent on MAC in the next stage are not enough for the current output stored in the RAM, hence the end time has to be delayed to the time of Counter3 = k -N i−1 . In this way, the cycles spent on MAC are just enough for the leftover output stored in RAM, avoiding the storage conflict in the next stage. The signal of the "StageEnd" in the last stage of the last layer can also be used as the end signal of the forward calculation phase.

Backward Calculation Block
The backward calculation block is used to calculate the sensitivity of the neural network, which corresponds to (7)~(9). The diagram of the backward calculation block is shown in Figure 10.
stage. The signal of the "StageEnd" in the last stage of the last layer can also be used as the end signal of the forward calculation phase.

Backward Calculation Block
The backward calculation block is used to calculate the sensitivity of the neural network, which corresponds to (7)~(9). The diagram of the backward calculation block is shown in Figure 10.  Since some components in the forward calculation block are similar to those in the backward calculation block, they will not be reintroduced. The "folded" systolic array depicted in Figure 8 is also implemented in the backward calculation block with a different weight access order. This approach effectively avoids simultaneous RAM access. The details of the access order will be explained in Section 5. To accommodate the cross-access of different RAMs required in the backward calculation phase, the "Weight Addr Cal" and "WTrans" modules are necessary to determine the appropriate weight order.
Based on Equations (7) and (8), the multipliers in the backward calculation phase must perform multiple multiplications as opposed to just the multiplication in the MAC operation. Consequently, the inputs of the multipliers are switched according to the inner state machine, as illustrated by the dashed rectangle in Figure 10. Within this figure, "BState" represents the signal of the internal state machine, "MacEnd" indicates the completion of the MAC operation, "DerivativeEnd" signifies the completion of the derivative calculation, and "MulEnd" marks the completion of all multiplications in the current stage. As a result, the multipliers are multiplexed to serve multiple functions, thereby conserving resources.
The end time of each stage in the backward calculation block is shown in Figure 9b, which is similar to that in the forward calculation block. The only difference is that there is no activation function calculation in the backward calculation block, so the storage will start immediately after the completion of MAC. Therefore, the time of the first sensitivity value stored in the RAM is also the moment that MAC finishes, i.e., Couner3 = 1.
Unlike in forward calculation phases, the "StageEnd" signal in the last stage of the last layer cannot be directly used as the end signal of backward calculation phases. The end time determination of the backward calculation phase is shown in Figure 9c, where "Counter4" starts as the stage end and ends as the backward calculation phase ended. The Since some components in the forward calculation block are similar to those in the backward calculation block, they will not be reintroduced. The "folded" systolic array depicted in Figure 8 is also implemented in the backward calculation block with a different weight access order. This approach effectively avoids simultaneous RAM access. The details of the access order will be explained in Section 5. To accommodate the cross-access of different RAMs required in the backward calculation phase, the "Weight Addr Cal" and "WTrans" modules are necessary to determine the appropriate weight order.
Based on Equations (7) and (8), the multipliers in the backward calculation phase must perform multiple multiplications as opposed to just the multiplication in the MAC operation. Consequently, the inputs of the multipliers are switched according to the inner state machine, as illustrated by the dashed rectangle in Figure 10. Within this figure, "BState" represents the signal of the internal state machine, "MacEnd" indicates the completion of the MAC operation, "DerivativeEnd" signifies the completion of the derivative calculation, and "MulEnd" marks the completion of all multiplications in the current stage. As a result, the multipliers are multiplexed to serve multiple functions, thereby conserving resources.
The end time of each stage in the backward calculation block is shown in Figure 9b, which is similar to that in the forward calculation block. The only difference is that there is no activation function calculation in the backward calculation block, so the storage will start immediately after the completion of MAC. Therefore, the time of the first sensitivity value stored in the RAM is also the moment that MAC finishes, i.e., Couner3 = 1.
Unlike in forward calculation phases, the "StageEnd" signal in the last stage of the last layer cannot be directly used as the end signal of backward calculation phases. The end time determination of the backward calculation phase is shown in Figure 9c, where "Counter4" starts as the stage end and ends as the backward calculation phase ended. The "StageEnd" signal in the backward calculation phase is the same moment as the first sensitivity value stored in RAM, but the access of V may not end at this moment as the value of V is used to calculate the derivative of the activation function in (8). Therefore, if the backward calculation phase is the end at this moment, the access of V may be conflicted in the backward calculation and update phases. Since the "StageEnd" signal is seven clock cycles later than the first access of V, the end signal of the backward calculation phases can be set to the "StageEnd" signal if the "ActiveLast" is less than or equal to seven, where the "ActiveLast" is the neuron number active in the calculation in the last stage of the last layer.
Whereas, the end signal of backward calculation phases should be ActiveNum-7 cycles later than the "StageEnd" signal if the "ActiveLast" is more than seven.

Update Block
The update block is mainly used for the update of weight, which corresponds to (6) and (10). The diagram of the update block is shown in Figure 11. sitivity value stored in RAM, but the access of V may not end at this moment as the value of V is used to calculate the derivative of the activation function in (8). Therefore, if the backward calculation phase is the end at this moment, the access of V may be conflicted in the backward calculation and update phases. Since the "StageEnd" signal is seven clock cycles later than the first access of V, the end signal of the backward calculation phases can be set to the "StageEnd" signal if the "ActiveLast" is less than or equal to seven, where the "ActiveLast" is the neuron number active in the calculation in the last stage of the last layer. Whereas, the end signal of backward calculation phases should be ActiveNum-7 cycles later than the "StageEnd" signal if the "ActiveLast" is more than seven.

Update Block
The update block is mainly used for the update of weight, which corresponds to (6) and (10). The diagram of the update block is shown in Figure 11.  Figure 11. Diagram of the update block.
Since the value of V and S are all stored in the RAMs and only one value is accessed in one clock cycle, the value of S is locked in the registers after every access to support the parallel multiplication, and the constant multiplication in (6) is performed after the access of S. The signal flow of multiplication is shown in Figure 12. In Figure 12, the inputs flow across the multipliers in sequence, executing the parallel multiplication with ղS, which can make full use of the multipliers. The adjustments of weight can be obtained after the multiplication, the values of which are added to the original weights to obtain the new values, and then updated in the storage. Once the input of V and S completes at the current stage, the input to the next stage can be immediately executed, as there is no accumulation or selection operation involved in the update block, unlike in the forward or backward calculation block. As a result, the end of the stage can be determined at the moment when the access to V and S is completed. Furthermore, the "StageEnd" signal in the last stage of the final layer can function as the concluding signal for the update phase. Since the value of V and S are all stored in the RAMs and only one value is accessed in one clock cycle, the value of S is locked in the registers after every access to support the parallel multiplication, and the constant multiplication in (6) is performed after the access of S. The signal flow of multiplication is shown in Figure 12.
"StageEnd" signal in the backward calculation phase is the same moment as the first sensitivity value stored in RAM, but the access of V may not end at this moment as the value of V is used to calculate the derivative of the activation function in (8). Therefore, if the backward calculation phase is the end at this moment, the access of V may be conflicted in the backward calculation and update phases. Since the "StageEnd" signal is seven clock cycles later than the first access of V, the end signal of the backward calculation phases can be set to the "StageEnd" signal if the "ActiveLast" is less than or equal to seven, where the "ActiveLast" is the neuron number active in the calculation in the last stage of the last layer. Whereas, the end signal of backward calculation phases should be ActiveNum-7 cycles later than the "StageEnd" signal if the "ActiveLast" is more than seven.

Update Block
The update block is mainly used for the update of weight, which corresponds to (6) and (10). The diagram of the update block is shown in Figure 11.  Figure 11. Diagram of the update block.
Since the value of V and S are all stored in the RAMs and only one value is accessed in one clock cycle, the value of S is locked in the registers after every access to support the parallel multiplication, and the constant multiplication in (6) is performed after the access of S. The signal flow of multiplication is shown in Figure 12. In Figure 12, the inputs flow across the multipliers in sequence, executing the parallel multiplication with ղS, which can make full use of the multipliers. The adjustments of weight can be obtained after the multiplication, the values of which are added to the original weights to obtain the new values, and then updated in the storage. Once the input of V and S completes at the current stage, the input to the next stage can be immediately executed, as there is no accumulation or selection operation involved in the update block, unlike in the forward or backward calculation block. As a result, the end of the stage can be determined at the moment when the access to V and S is completed. Furthermore, the "StageEnd" signal in the last stage of the final layer can function as the concluding signal for the update phase. In Figure 12, the inputs flow across the multipliers in sequence, executing the parallel multiplication with "StageEnd" signal in the backward calculation phase is the same moment as the first sensitivity value stored in RAM, but the access of V may not end at this moment as the value of V is used to calculate the derivative of the activation function in (8). Therefore, if the backward calculation phase is the end at this moment, the access of V may be conflicted in the backward calculation and update phases. Since the "StageEnd" signal is seven clock cycles later than the first access of V, the end signal of the backward calculation phases can be set to the "StageEnd" signal if the "ActiveLast" is less than or equal to seven, where the "ActiveLast" is the neuron number active in the calculation in the last stage of the last layer. Whereas, the end signal of backward calculation phases should be ActiveNum-7 cycles later than the "StageEnd" signal if the "ActiveLast" is more than seven.

Update Block
The update block is mainly used for the update of weight, which corresponds to (6) and (10). The diagram of the update block is shown in Figure 11. Since the value of V and S are all stored in the RAMs and only one value is accessed in one clock cycle, the value of S is locked in the registers after every access to support the parallel multiplication, and the constant multiplication in (6) is performed after the access of S. The signal flow of multiplication is shown in Figure 12. In Figure 12, the inputs flow across the multipliers in sequence, executing the parallel multiplication with ղS, which can make full use of the multipliers. The adjustments of weight can be obtained after the multiplication, the values of which are added to the original weights to obtain the new values, and then updated in the storage. Once the input of V and S completes at the current stage, the input to the next stage can be immediately executed, as there is no accumulation or selection operation involved in the update block, unlike in the forward or backward calculation block. As a result, the end of the stage can be determined at the moment when the access to V and S is completed. Furthermore, the "StageEnd" signal in the last stage of the final layer can function as the concluding signal for the update phase.
S, which can make full use of the multipliers. The adjustments of weight can be obtained after the multiplication, the values of which are added to the original weights to obtain the new values, and then updated in the storage. Once the input of V and S completes at the current stage, the input to the next stage can be immediately executed, as there is no accumulation or selection operation involved in the update block, unlike in the forward or backward calculation block. As a result, the end of the stage can be determined at the moment when the access to V and S is completed. Furthermore, the "StageEnd" signal in the last stage of the final layer can function as the concluding signal for the update phase.

Weight Storage and Access
To better adapt to the proposed architecture and further reduce the usage of the RAM block, a new weight storage and access method is proposed and will be introduced in this section. To make a clear introduction, set m = N i−1 , n = N i , floor(m/k) = γ, floor(n/k) = λ, where floor means round to the nearest integer less than or equal to the element.

Weight Storage
The number of RAMs should be enough to support the simultaneous access of weights, and the weight number in one RAM should be as more as possible to reduce the use of the RAM block. The maximum simultaneous access number of weight is k in the proposed method, and the weights are therefore segmented and recombined to be stored in k RAMs.
The weight matrix in the forward and backward calculation phases are shown in Figure 13a,b, where the number in the square is the subscript of the variables, and the bias and corresponding weight are placed in the dashed square since they are not used in the backward calculation phases. The square with the same color also indicates that they are stored in the same RAM.

Weight Storage and Access
To better adapt to the proposed architecture and further reduce the usage of the RAM block, a new weight storage and access method is proposed and will be introduced in this section. To make a clear introduction, set m = Ni−1, n = Ni, floor(m/k) = γ, floor(n/k) = λ, where floor means round to the nearest integer less than or equal to the element.

Weight Storage
The number of RAMs should be enough to support the simultaneous access of weights, and the weight number in one RAM should be as more as possible to reduce the use of the RAM block. The maximum simultaneous access number of weight is k in the proposed method, and the weights are therefore segmented and recombined to be stored in k RAMs.
The weight matrix in the forward and backward calculation phases are shown in Figure 13a,b, where the number in the square is the subscript of the variables, and the bias and corresponding weight are placed in the dashed square since they are not used in the backward calculation phases. The square with the same color also indicates that they are stored in the same RAM.  The weights are stored in "RAM_W1~k" according to the access order of weight in the forward calculation. Firstly, the weights in the first k rows of Figure 13a are stored in "RAM_W1~k" in sequence, and then the weights in the k + 1 row are stored from "RAM_W1" again. The weights in the subsequent rows follow the same storage rule. The storage order in one layer is shown in Figure 13c.
In addition, the weights belonging to the same neuron position in different layers are recombined in the order of the layers. For example, the storage order of weights in the network of topology 3-4-5-2 with three NCUs is shown in Figure 14. It can be observed in Figure 14 that the start address of the weights in different layers vary, hence the address offsets of different layers are tabulated beforehand. Moreover, the weights are initialized individually following the order of "RAM_W1~k", which differs The weights are stored in "RAM_W1~k" according to the access order of weight in the forward calculation. Firstly, the weights in the first k rows of Figure 13a are stored in "RAM_W1~k" in sequence, and then the weights in the k + 1 row are stored from "RAM_W1" again. The weights in the subsequent rows follow the same storage rule. The storage order in one layer is shown in Figure 13c.
In addition, the weights belonging to the same neuron position in different layers are recombined in the order of the layers. For example, the storage order of weights in the network of topology 3-4-5-2 with three NCUs is shown in Figure 14.

Weight Storage and Access
To better adapt to the proposed architecture and further reduce the usage of the RAM block, a new weight storage and access method is proposed and will be introduced in this section. To make a clear introduction, set m = Ni−1, n = Ni, floor(m/k) = γ, floor(n/k) = λ, where floor means round to the nearest integer less than or equal to the element.

Weight Storage
The number of RAMs should be enough to support the simultaneous access of weights, and the weight number in one RAM should be as more as possible to reduce the use of the RAM block. The maximum simultaneous access number of weight is k in the proposed method, and the weights are therefore segmented and recombined to be stored in k RAMs.
The weight matrix in the forward and backward calculation phases are shown in Figure 13a,b, where the number in the square is the subscript of the variables, and the bias and corresponding weight are placed in the dashed square since they are not used in the backward calculation phases. The square with the same color also indicates that they are stored in the same RAM.  The weights are stored in "RAM_W1~k" according to the access order of weight in the forward calculation. Firstly, the weights in the first k rows of Figure 13a are stored in "RAM_W1~k" in sequence, and then the weights in the k + 1 row are stored from "RAM_W1" again. The weights in the subsequent rows follow the same storage rule. The storage order in one layer is shown in Figure 13c.
In addition, the weights belonging to the same neuron position in different layers are recombined in the order of the layers. For example, the storage order of weights in the network of topology 3-4-5-2 with three NCUs is shown in Figure 14. It can be observed in Figure 14 that the start address of the weights in different layers vary, hence the address offsets of different layers are tabulated beforehand. Moreover, the weights are initialized individually following the order of "RAM_W1~k", which differs It can be observed in Figure 14 that the start address of the weights in different layers vary, hence the address offsets of different layers are tabulated beforehand. Moreover, the weights are initialized individually following the order of "RAM_W1~k", which differs from the weight matrix format. Therefore, an m file script that can convert the weight matrix to the proposed form is created, which is convenient for initialization.

Weight Access
Since the weights are stored according to the access order of the forward calculation, the access order in the forward calculation phase is just the sequence of storage with certain cycles delay, in which the delays are used for the flow of the systolic array. The access order is shown in Figure 15a. Conversely, the weight access in the backward calculation phase is relatively complex on account of the cross-access of different RAMs, which is shown in Figure 15b. from the weight matrix format. Therefore, an m file script that can convert the weight matrix to the proposed form is created, which is convenient for initialization.

Weight Access
Since the weights are stored according to the access order of the forward calculation, the access order in the forward calculation phase is just the sequence of storage with certain cycles delay, in which the delays are used for the flow of the systolic array. The access order is shown in Figure 15a. Conversely, the weight access in the backward calculation phase is relatively complex on account of the cross-access of different RAMs, which is shown in Figure 15b. In Figure 15b, the first row in stage 1 is W11~W1n, and W11 is from "RAMW_1" at "Clk1", W12 is from "RAMW_2" at "Clk2", and W1k is from "RAMW_k" at "Clkk". Then, at "Clkk + 1", W1(k+1) is from "RAMW_1" again, and the following obey the same rule. In the next stage, the access weight is from W(k+1)1 and the order is the same as that in stage 1 as well as the following stage. The access order not only satisfies the simultaneous access of k RAMs but also avoids the simultaneous access in one RAM. The relevant pseudo-code to obtain the read address of the weight in the backward calculation phase is shown in Algorithm 1, where "ActiveNum" is the number of NCUs actually executed in the current stage, and "LayerOffset" is the offset of the current layer, which is obtained by looking up the offset table of the layer. if Count == k − 1 then 9: if Offset ≥ n then 10: Offset ← 0 11: Count ← 0 12: En ← 0 13: else 14: Offse t← Offset + m + 1 15: Count ← 0 16: end 17: else In Figure 15b, the first row in stage 1 is W 11~W1n , and W 11 is from "RAMW_1" at "Clk1", W 12 is from "RAMW_2" at "Clk2", and W 1k is from "RAMW_k" at "Clkk". Then, at "Clkk + 1", W 1(k+1) is from "RAMW_1" again, and the following obey the same rule. In the next stage, the access weight is from W (k+1)1 and the order is the same as that in stage 1 as well as the following stage. The access order not only satisfies the simultaneous access of k RAMs but also avoids the simultaneous access in one RAM. The relevant pseudo-code to obtain the read address of the weight in the backward calculation phase is shown in Algorithm 1, where "ActiveNum" is the number of NCUs actually executed in the current stage, and "LayerOffset" is the offset of the current layer, which is obtained by looking up the offset table of the layer.
In the pseudo-code, the "Addend" is bounded by min(Count, ActiveNum) due to the possibility that the "ActiveNum" may be less than k in the final stage.
Additionally, the delays shown in Figure 15 appear to introduce more calculation cycles. However, the overall calculation cycle is not increased since the parallel outputs of the Multiply-Accumulate (MAC) units are converted into a serial data stream. Taking the calculation of sensitivity in the backward calculation phase as an example, the analysis of time sequence is shown in Figure 16, where "SCalTime" is the time used by MAC and T k is the time when the storage starts. In the pseudo-code, the "Addend" is bounded by min(Count, ActiveNum) due to the possibility that the "ActiveNum" may be less than k in the final stage.
Additionally, the delays shown in Figure 15 appear to introduce more calculation cycles. However, the overall calculation cycle is not increased since the parallel outputs of the Multiply-Accumulate (MAC) units are converted into a serial data stream. Taking the calculation of sensitivity in the backward calculation phase as an example, the analysis of time sequence is shown in Figure 16, where "SCalTime" is the time used by MAC and Tk is the time when the storage starts. In Figure 16a, despite all the MAC finished at the time of Tk, the storage complete time is Tk+3, since the store is serial. While in Figure 16b, the calculation completed time of Si is Ti+k−1, which is i − 1 cycle later than Tk due to the input delay. However, the storage completed time is still Tk+3 because the storage moment is just the calculation complete time of Si. As a result, the delay arrangement will not increase the whole calculation time,  Param: m, n, k 3: Output: RdAddr 4: if Start then 5: En ← 1 6: end 7: if En then 8: if Count == k -1 then 9: if Offset ≥ n then 10: Offset ← 0 11: Count ← 0 12: En ← 0 13: else 14: Offse t← Offset + m + In Figure 16a, despite all the MAC finished at the time of T k , the storage complete time is T k+3 , since the store is serial. While in Figure 16b, the calculation completed time of S i is T i+k−1 , which is i − 1 cycle later than T k due to the input delay. However, the storage completed time is still T k+3 because the storage moment is just the calculation complete time of S i . As a result, the delay arrangement will not increase the whole calculation time, with only a slight increase in register usage.
Through the above strategy of weight storage and access, only the k RAM module is needed, which maximizes support for parallel computing while reducing the usage of RAM blocks.

Performance Model
The proposed method is characterized by a highly parameterized design, featuring a configurable topology, data bit-width, and NCU number, thus enabling its wide application. In order to expedite the parameter selection process, a performance model is established for evaluating the proposed method. The performance index encompasses essential metrics, including calculation cycles, resource consumption, and maximum frequency, among others.
The calculation cycles of each layer during the forward calculation phase can be expressed as follows: where L is the number of layers of the neural network, and ceil means round to the nearest integer greater than or equal to the element. The equation is consistent with the flowchart in Figure 9a. In the equation, N i−1 + 10 − (I == L)×5 is the cycles consumed on the stage whose next stage is in the new layer, where N i−1 + 10 is the cycles spent on MAC, RAM reading and writing, activation function calculation, and the multiplier pipelines, etc. In the last layer, the end signal of the stage is brought forward to the moment of MAC completion, i.e., 5 cycles ahead. N i−1 + 6 + max(1, k − N i−1 ) is the cycle consumption of the stage whose next stage is not in the new layer, and ceil(N i /k) − 1 is used to calculate the times of these stages. The cycles spent on the forward calculation phases are: F_Cycle_Layer(i)) (12) The cycles spent on one layer of the backward calculation phase are similar to that in the forward calculation phase: (13) Considering the case in Figure 9c, the cycles spent on the backward calculation phases are as follows: where "+1" is used to separate different blocks. The calculation cycles of one layer in the update phase are as follows: Since the end of the stage is set at the finish moment of V and S in the update phase, the cycle of one stage in the update phase is max((N i−1 + 1),k) + 1, and "+1" is for reading from RAM, whilst floor(N i /k) is used to calculate the stage number with full k NCUs execution. In the last stage, the active NCU number may be less than k, so the end time is brought forward to max((N i−1 + 1), Remainder) + 1, where Remainder is the remainder of (N i /k).
The cycles spent on the update phases are as follows: U_Cycle_Layer(i)) + 2 (16) where "+2" is for the reading and writing weight. The sum of cycles spent on the whole calculation is as follows, where "+1" is for loading samples: The resource consumption of the proposed algorithm is roughly linear with the number of NCUs and data bit-width, which can be observed in Section 7. Therefore, the modeling of resources can be as follows: where C R is a constant coefficient, which is varied in different types of resources. B W is the data bit-width. The maximum frequency of the proposed method is correlated with data bit-width, which can be approximated by: where C BW and Coff are constant coefficients, which are varied in different types of FPGA, and MaxF is the maximum frequency.
In addition, an index of throughput/resources that comprehensively reflects resource utilization is also analyzed for a better comparison. The throughput is the sample number processed in one second. Therefore, the index of throughput/resource is as follows: where T_R is throughput/resource. According to the above equations, the resource consumption is increased and the maximum frequency is decreased as the data bit-width increases, which also results in a low throughput/resource, indicating a low resource utilization. Therefore, fewer bit-width should be selected under the premise of ensuring accuracy. The resource consumption is decreased and the calculation cycle is increased when the number of NCUs is decreased, which reflects the area-delay characteristic of FPGA. The index of throughput/resource is increased in general as the number of NCUs is decreased, except for the case when the number of NCU is the divisor of the number of neurons. Therefore, the number range of NCUs can be first determined by the timing and resource constraint, and then the number should be as few as possible and inclined to choose the divisor of the number of neurons.

Verification and Comparison
The proposed method was compared to other on-chip learning methods of MLP implemented on FPGA, and the processing speed of the proposed method was also compared across different platforms. Lastly, the proposed method was validated using two practical applications.

Comparison with Other Methods
To begin with, the results obtained from the proposed method were compared to those published in [26]. While the index of maximum frequency and calculation cycles were not directly provided in [26], they could be inferred from Table 4 in [26] and Figure 8 in the original reference. The maximum frequency in [26] is around 185M, and the inferred cycles in [26] is a rough estimate, not an exact value, which is only used for comparison. In the proposed method, the number of Neuron Computation Units (NCUs) was configured to be the maximum number of neurons in the layer. Using the same topology, the FPGA chip of Xilinx Virtex5 series, and the synthesis tool of ISE, a comparison of the two methods was conducted, and the results are presented in Table 2. From Table 2, it can be observed that the maximum frequency of the proposed method is comparable to the method in [26], and the proposed method can save hardware resources significantly, with LUT saving about 77.4%, registers 65.1%, BRAM 92.1%, and DSP 26.3%. At the same time, the calculated cycle of the proposed method increases by 65.1%. To make a more comprehensive comparison, the throughput/resource of the two methods is compared and the increased percent of the proposed method is shown in Figure 17. From Table 2, it can be observed that the maximum frequency of the proposed method is comparable to the method in [26], and the proposed method can save hardware resources significantly, with LUT saving about 77.4%, registers 65.1%, BRAM 92.1%, and DSP 26.3%. At the same time, the calculated cycle of the proposed method increases by 65.1%. To make a more comprehensive comparison, the throughput/resource of the two methods is compared and the increased percent of the proposed method is shown in Figure 17. In Figure 17, the proposed method demonstrates a significantly higher index of throughput/resource compared to the method described in [26]. Specifically, the average throughput/LUT increases by 235.2%, throughput/register increases by approximately 114.0%, and throughput/BRAM increases by about 1816.7%. However, there is a minor decrease in throughput/DSP by 3.0%. These results indicate that the proposed method exhibits much higher resource utilization. Furthermore, it is worth highlighting that the hardware resource consumption can be further reduced while simultaneously increasing resource utilization by decreasing the number of NCUs in the proposed method. Taking the example of the topology presented in Table 2 with a configuration of 10-50-1, the performance index varies with different numbers of NCUs, as illustrated in Table 3. The implementation is still conducted in the FPGA chip of Xilinx Virtex5 series as the method in [26].  Table 3 shows that as the number of NCUs increased, the resource consumption increased and the cycles decreased, while the maximum frequency changes little. The resource consumption of different numbers of NCUs in both Tables 2 and 3 and the In Figure 17, the proposed method demonstrates a significantly higher index of throughput/resource compared to the method described in [26]. Specifically, the average throughput/LUT increases by 235.2%, throughput/register increases by approximately 114.0%, and throughput/BRAM increases by about 1816.7%. However, there is a minor decrease in throughput/DSP by 3.0%. These results indicate that the proposed method exhibits much higher resource utilization. Furthermore, it is worth highlighting that the hardware resource consumption can be further reduced while simultaneously increasing resource utilization by decreasing the number of NCUs in the proposed method. Taking the example of the topology presented in Table 2 with a configuration of 10-50-1, the performance index varies with different numbers of NCUs, as illustrated in Table 3. The implementation is still conducted in the FPGA chip of Xilinx Virtex5 series as the method in [26].  Table 3 shows that as the number of NCUs increased, the resource consumption increased and the cycles decreased, while the maximum frequency changes little. The resource consumption of different numbers of NCUs in both Tables 2 and 3 and the corresponding fitting curves are shown in Figure 18a. For a better display, the resource of LUT and registers are divided by 100. corresponding fitting curves are shown in Figure 18a. For a better display, the resource of LUT and registers are divided by 100.
(a) (b) In Figure 18a, most points can fit the linear curves, indicating the linear relationship between the resource and the configured NCU number. Only the RAM resource of topology 10-50-1 with 50 NCUs deviates, and the reason is that the weight number in each RAM is few, so the RAMs of weight are all mapped to the distributed RAM instead of Block RAMs.
The calculation cycles of topology 10-50-1 varying with NCU number are present as the circles in Figure 18b. To verify Equation (17), the curve of inferred cycles varying with NCU is also presented in Figure 18b. The figure shows that the inferred cycles are consistent with the practically executed cycles, which validates Equation (17). The calculation cycles are also not always increased with the NCU number growth. For example, the calculation cycles have a slight growth as the NCU number increases from 25 to 49. Therefore, it is better to confirm the calculation cycles by Equation (17) in the parameter selection process.
The topology of 10-50-1 should still be taken as an example to analyze the relationship of throughput/resources and NCU number, as shown in Figure 19. For a better display, the values of throughput/LUT and throughput/register are multiplied by 1000. The actual throughput/resources of Table 3 varying with the NCU number are presented as dots in Figure 19, while the lines in the figure are the throughput/resources inferred by Equation (20). Equation (20) is validated as the variation trend of the curve is consistent with that of the dot. The values of throughput/resources are generally increased as the number of NCUs decreased, except for some special cases, such as the number of NCUs being 9 and 10. Therefore, it is better to simulate the throughput/resources before the design. In addition, as is shown in Figure 17, the index of throughput/resource of our In Figure 18a, most points can fit the linear curves, indicating the linear relationship between the resource and the configured NCU number. Only the RAM resource of topology 10-50-1 with 50 NCUs deviates, and the reason is that the weight number in each RAM is few, so the RAMs of weight are all mapped to the distributed RAM instead of Block RAMs.
The calculation cycles of topology 10-50-1 varying with NCU number are present as the circles in Figure 18b. To verify Equation (17), the curve of inferred cycles varying with NCU is also presented in Figure 18b. The figure shows that the inferred cycles are consistent with the practically executed cycles, which validates Equation (17). The calculation cycles are also not always increased with the NCU number growth. For example, the calculation cycles have a slight growth as the NCU number increases from 25 to 49. Therefore, it is better to confirm the calculation cycles by Equation (17) in the parameter selection process.
The topology of 10-50-1 should still be taken as an example to analyze the relationship of throughput/resources and NCU number, as shown in Figure 19. For a better display, the values of throughput/LUT and throughput/register are multiplied by 1000. corresponding fitting curves are shown in Figure 18a. For a better display, the resource of LUT and registers are divided by 100. In Figure 18a, most points can fit the linear curves, indicating the linear relationship between the resource and the configured NCU number. Only the RAM resource of topology 10-50-1 with 50 NCUs deviates, and the reason is that the weight number in each RAM is few, so the RAMs of weight are all mapped to the distributed RAM instead of Block RAMs.
The calculation cycles of topology 10-50-1 varying with NCU number are present as the circles in Figure 18b. To verify Equation (17), the curve of inferred cycles varying with NCU is also presented in Figure 18b. The figure shows that the inferred cycles are consistent with the practically executed cycles, which validates Equation (17). The calculation cycles are also not always increased with the NCU number growth. For example, the calculation cycles have a slight growth as the NCU number increases from 25 to 49. Therefore, it is better to confirm the calculation cycles by Equation (17) in the parameter selection process.
The topology of 10-50-1 should still be taken as an example to analyze the relationship of throughput/resources and NCU number, as shown in Figure 19. For a better display, the values of throughput/LUT and throughput/register are multiplied by 1000. The actual throughput/resources of Table 3 varying with the NCU number are presented as dots in Figure 19, while the lines in the figure are the throughput/resources inferred by Equation (20). Equation (20) is validated as the variation trend of the curve is consistent with that of the dot. The values of throughput/resources are generally increased as the number of NCUs decreased, except for some special cases, such as the number of NCUs being 9 and 10. Therefore, it is better to simulate the throughput/resources before the design. In addition, as is shown in Figure 17, the index of throughput/resource of our The actual throughput/resources of Table 3 varying with the NCU number are presented as dots in Figure 19, while the lines in the figure are the throughput/resources inferred by Equation (20). Equation (20) is validated as the variation trend of the curve is consistent with that of the dot. The values of throughput/resources are generally increased as the number of NCUs decreased, except for some special cases, such as the number of NCUs being 9 and 10. Therefore, it is better to simulate the throughput/resources before the design. In addition, as is shown in Figure 17, the index of throughput/resource of our method is higher than the method in [26] at any topology in Table 2, except for the topology of 10-50-1. Therefore, the index of throughput/resource of the topology of 10-50-1 in [26] is also present in Figure 19 as the triangle. It can be observed that the index of throughput/resource in our method can easily exceed the method in [26] as the number of NCU decreased to a certain value.
In addition, the performance of the proposed method is also influenced by the data bit-width. Taking the topology 10-6-3-2 in Table 2 as an example, the performance index varying with different data bit-width is shown in Table 4. The form of bit-width is (BS,BI,BF), where BS is the sign bit-width for supporting negative numbers, and where BI and BF are integer bit-width and fractional bit-width. The implementation is still conducted in the FPGA chip of Xilinx Virtex5. The resource consumption of different NCU numbers in Table 4 and the corresponding fitting curves are shown in Figure 20a, which is consistent with Equation (18). The maximum frequency of different NCU numbers and the inferred value of Equation (19) are shown in Figure 20b, which verifies Equation (19).
Electronics 2023, 12, x FOR PEER REVIEW 20 of 27 method is higher than the method in [26] at any topology in Table 2, except for the topology of 10-50-1. Therefore, the index of throughput/resource of the topology of 10-50-1 in [26] is also present in Figure 19 as the triangle. It can be observed that the index of throughput/resource in our method can easily exceed the method in [26] as the number of NCU decreased to a certain value. In addition, the performance of the proposed method is also influenced by the data bit-width. Taking the topology 10-6-3-2 in Table 2 as an example, the performance index varying with different data bit-width is shown in Table 4. The form of bit-width is (BS,BI,BF), where BS is the sign bit-width for supporting negative numbers, and where BI and BF are integer bit-width and fractional bit-width. The implementation is still conducted in the FPGA chip of Xilinx Virtex5. The resource consumption of different NCU numbers in Table 4 and the corresponding fitting curves are shown in Figure 20a, which is consistent with Equation (18). The maximum frequency of different NCU numbers and the inferred value of Equation (19) are shown in Figure 20b, which verifies Equation (19). For a more comprehensive comparison, the proposed method is compared with the traditional layer multiplexing method, which is also the work of Ortega-Zamorano [13]. In [13], the resource consumption of one neuron is provided instead of the resource consumption of concrete topology. The type of FPGA used is also the Virtex5 serials, and the bit-width is BI = BF = 16. The comparison of the resource consumption of one neuron is shown in Table 5. For a more comprehensive comparison, the proposed method is compared with the traditional layer multiplexing method, which is also the work of Ortega-Zamorano [13]. In [13], the resource consumption of one neuron is provided instead of the resource consumption of concrete topology. The type of FPGA used is also the Virtex5 serials, and the bit-width is B I = B F = 16. The comparison of the resource consumption of one neuron is shown in Table 5. According to Table 5, the resource consumption of one neuron of the two methods is equivalent [13], except for the Block RAM, and the reason is that the motivation for RAM usage of the two methods is different, as the RAM resource is designed to be fully used in [13] while we try to save RAM as much as possible.
The calculation cycles of the topology in Table 2 can be inferred by the equation provided in [13], and the number of NCUs in the proposed method is configured as the maximum number of neurons in the layer. The cycle comparison result of the two methods is shown in Table 6. Table 6. Cycle Comparison of Proposed Method and Method in [13].  In Table 6, compared to the method in [13], the proposed method consumes an average of 21.3% fewer calculation cycles due to the compact timing arrangement. The maximum frequency of the proposed method is about 160 M, while the value of the method in [13] is 200 M, so the maximum frequency of the proposed method is 20% lower. Therefore, it can be inferred that the two methods have an equivalent index of throughput/resource, which reflects that the resource utilization of the two methods is also equivalent, indicating that the layer multiplexing algorithm can be seen as a special case of the proposed method. However, the proposed method is more flexible as the number of NCUs is configurable, the resource consumption can further decrease, and resource utilization can further increase as the number of NCUs decreases.
Further, the proposed method is compared with the method in [27,30,38]. The comparison results are shown in Tables 7-9. The number in the bracket in the column of "Type" indicates the NCU number used in the proposed method. Compared with the method in [27,30,38], our method may be inferior in terms of calculation speed. However, the proposed method has a significant advantage in terms of resource consumption, which becomes more pronounced as the number of NCUs decreases. Therefore, the overall resource utilization rate of the proposed method is higher, as the ratio of resource savings compensates for the lag in calculation speed. Additionally, the power consumption of the proposed method is lower than that of other methods. All of the above comparison are summered in Table 10. All in all, when compared with other methods, the proposed method exhibits similar or slightly inferior levels of maximum frequency or cycle consumption. However, it offers a significant advantage in terms of hardware resource consumption and resource utilization. Furthermore, this advantage can be further amplified with a decrease in the number of configured NCUs. Moreover, the flexibility of the proposed method makes it more convenient for broader applications.

Comparison of Calculation Speed in Different Platforms
Take the topology 2-3-4-2 of the neural network as an example, and the learning time of one sample by the proposed method implemented on FPGA is compared with that implemented on the platform of Matlab and DSP chip. The result is shown in Figure 21, the abscissa of which is logarithmically scaled for a better display. venient for broader applications.

Comparison of Calculation Speed in Different Platforms
Take the topology 2-3-4-2 of the neural network as an example, and the learning time of one sample by the proposed method implemented on FPGA is compared with that implemented on the platform of Matlab and DSP chip. The result is shown in Figure 21, the abscissa of which is logarithmically scaled for a better display.  The result in Figure 21 shows that the processing speed of the neural network implemented in the DSP chip is already much faster than that implemented in Matlab. However, the processing speed of the proposed method implemented on FPGA can be about a hundred times faster than that implemented in the DSP chip, which shows the superiority of FPGA implementation.

Verification
The proposed method is verified by two practical applications, all of which are implemented on the Xilinx Artix7 xc7a35t FPGA development board. The verification results are compared with those verified in Matlab, and the latter is more precise due to the use of the double-precision floating-point data form.

Iris Set Classification
The classification of a well-known Iris set is used to verify the proposed method [13]. The dataset is divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 4-5-3, and the data bit-width of the proposed method is selected as [1,7,16]. Given the same initial weight value, the comparison of the training and validation error curve is shown in Figure 22a, and Figure 22b is the partial enlargement of Figure 22a. The result in Figure 21 shows that the processing speed of the neural network implemented in the DSP chip is already much faster than that implemented in Matlab. However, the processing speed of the proposed method implemented on FPGA can be about a hundred times faster than that implemented in the DSP chip, which shows the superiority of FPGA implementation.

Verification
The proposed method is verified by two practical applications, all of which are implemented on the Xilinx Artix7 xc7a35t FPGA development board. The verification results are compared with those verified in Matlab, and the latter is more precise due to the use of the double-precision floating-point data form.

Iris Set Classification
The classification of a well-known Iris set is used to verify the proposed method [13]. The dataset is divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 4-5-3, and the data bit-width of the proposed method is selected as [1,7,16]. Given the same initial weight value, the comparison of the training and validation error curve is shown in Figure 22a, and Figure 22b is the partial enlargement of Figure 22a. In Figure 22, the training and validation error curve of the proposed method implemented in FPGA is almost the same as the error curves of the neural network trained in Matlab. After training, the test dataset is used to verify the training result. Only one test error in all the 45 test data on both two platforms indicates that the identification accuracy reaches 97.7% and verifies the correctness of the proposed method.
Since increasing the number of NCUs can save the training cycles of one epoch, the convergence speed can also be accelerated and the total training time can be shortened by increasing the number of NCUs. If the maximum absolute error of training is set to 0.1, the training error curve with the different numbers of NCUs is shown in Figure 23. In Figure 22, the training and validation error curve of the proposed method implemented in FPGA is almost the same as the error curves of the neural network trained in Matlab. After training, the test dataset is used to verify the training result. Only one test error in all the 45 test data on both two platforms indicates that the identification accuracy reaches 97.7% and verifies the correctness of the proposed method.
Since increasing the number of NCUs can save the training cycles of one epoch, the convergence speed can also be accelerated and the total training time can be shortened by increasing the number of NCUs. If the maximum absolute error of training is set to 0.1, the training error curve with the different numbers of NCUs is shown in Figure 23.

Flux Training
The proposed method is expected to be further used in some practical control applications, such as the motor drive. Therefore, a flux training example is used to verify the proposed method. The data of flux, voltage, and current in a motor drive system is used as the learning dataset, in which the voltage and current are used as the input and the flux is used as the output. The dataset has more than 1000 samples, which are divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 8-5-5-2 [32] and the data bit-width of the proposed method is selected as [1,7,18]. There are eight inputs because not only the voltage and current in the α and β axis of the contemporary cycle are used as inputs, but the values of the previous cycle are also applied as inputs. Given the same initial weight value, the training and validation error curve of the proposed method implemented on FPGA and the curves of the neural network trained in Matlab are compared in Figure 24a, and Figure 24b is the partial enlargement of Figure 24a. In Figure 24, the training and validation error curves of the two platforms are about the same, with a minor difference in the small error region. The difference is caused by the large training samples and the truncation error of the fixed-point data used in FPGA. The training results are tested by the test dataset after the training is finished. The curves of the ideal flux, the flux calculated by Matlab, and the flux calculated by the proposed method implemented on FPGA are shown in Figure 25.

Flux Training
The proposed method is expected to be further used in some practical control applications, such as the motor drive. Therefore, a flux training example is used to verify the proposed method. The data of flux, voltage, and current in a motor drive system is used as the learning dataset, in which the voltage and current are used as the input and the flux is used as the output. The dataset has more than 1000 samples, which are divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 8-5-5-2 [32] and the data bit-width of the proposed method is selected as [1,7,18]. There are eight inputs because not only the voltage and current in the α and β axis of the contemporary cycle are used as inputs, but the values of the previous cycle are also applied as inputs. Given the same initial weight value, the training and validation error curve of the proposed method implemented on FPGA and the curves of the neural network trained in Matlab are compared in Figure 24a, and Figure 24b is the partial enlargement of Figure 24a.

Flux Training
The proposed method is expected to be further used in some practical control applications, such as the motor drive. Therefore, a flux training example is used to verify the proposed method. The data of flux, voltage, and current in a motor drive system is used as the learning dataset, in which the voltage and current are used as the input and the flux is used as the output. The dataset has more than 1000 samples, which are divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 8-5-5-2 [32] and the data bit-width of the proposed method is selected as [1,7,18]. There are eight inputs because not only the voltage and current in the α and β axis of the contemporary cycle are used as inputs, but the values of the previous cycle are also applied as inputs. Given the same initial weight value, the training and validation error curve of the proposed method implemented on FPGA and the curves of the neural network trained in Matlab are compared in Figure 24a, and Figure 24b is the partial enlargement of Figure 24a. In Figure 24, the training and validation error curves of the two platforms are about the same, with a minor difference in the small error region. The difference is caused by the large training samples and the truncation error of the fixed-point data used in FPGA. The training results are tested by the test dataset after the training is finished. The curves of the ideal flux, the flux calculated by Matlab, and the flux calculated by the proposed method implemented on FPGA are shown in Figure 25. In Figure 24, the training and validation error curves of the two platforms are about the same, with a minor difference in the small error region. The difference is caused by the large training samples and the truncation error of the fixed-point data used in FPGA. The training results are tested by the test dataset after the training is finished. The curves of the ideal flux, the flux calculated by Matlab, and the flux calculated by the proposed method implemented on FPGA are shown in Figure 25. In Figure 25, the calculated flux of the proposed method has little deviation compared to the ideal value, which verifies the effectiveness of the proposed method.

Conclusions
An on-chip learning method for MLP implemented on embedded FPGA is introduced in this paper. The method proposed in this study is highly parameterized, allowing for easy adaptation to different applications. It is based on the utilization of the neuron multiplexing technique, which effectively decouples resource consumption from the specific topology of the neural network. As a result, the proposed method demonstrates greater applicability across a wide range of scenarios. To support the proposed method and further reduce RAM block usage, a novel weight segmentation and recombination method is presented, along with a detailed introduction of the weight access order. Additionally, a performance model is developed to evaluate the performance index of the proposed method and facilitate the parameter selection process. Furthermore, a comprehensive comparison is conducted between the proposed method and the alternative approaches, encompassing resource usage, calculation cycles, maximum frequency, and other relevant factors. The results obtained from the comparison reveal that the proposed method significantly reduces hardware resource requirements while maintaining equivalent performance in terms of maximum frequency. While the calculation cycle may be slightly longer, the overall resource utilization is higher. Moreover, as the number of NCUs decreases in the configuration, the advantages of the proposed method in terms of resource consumption or utilization become even more evident. Furthermore, it is noteworthy that the processing speed of the proposed method implemented on FPGA is at least a hundred times faster than that achievable with DSP or Matlab. Verification results obtained from the application of Iris set classification and flux training demonstrate that the error curve for the FPGA implementation closely resembles that observed during Matlab training. Furthermore, the output accuracy of the proposed method meets the requirements for practical applications.
Author Contributions: Conceptualization, software, validation, investigation, and writing-original draft preparation, Z.Z.; writing-review and editing, Z.Z., K.W., B.G. and G.C.; supervision, G.W. and K.W. All authors have read and agreed to the published version of the manuscript.  In Figure 25, the calculated flux of the proposed method has little deviation compared to the ideal value, which verifies the effectiveness of the proposed method.

Conclusions
An on-chip learning method for MLP implemented on embedded FPGA is introduced in this paper. The method proposed in this study is highly parameterized, allowing for easy adaptation to different applications. It is based on the utilization of the neuron multiplexing technique, which effectively decouples resource consumption from the specific topology of the neural network. As a result, the proposed method demonstrates greater applicability across a wide range of scenarios. To support the proposed method and further reduce RAM block usage, a novel weight segmentation and recombination method is presented, along with a detailed introduction of the weight access order. Additionally, a performance model is developed to evaluate the performance index of the proposed method and facilitate the parameter selection process. Furthermore, a comprehensive comparison is conducted between the proposed method and the alternative approaches, encompassing resource usage, calculation cycles, maximum frequency, and other relevant factors. The results obtained from the comparison reveal that the proposed method significantly reduces hardware resource requirements while maintaining equivalent performance in terms of maximum frequency. While the calculation cycle may be slightly longer, the overall resource utilization is higher. Moreover, as the number of NCUs decreases in the configuration, the advantages of the proposed method in terms of resource consumption or utilization become even more evident. Furthermore, it is noteworthy that the processing speed of the proposed method implemented on FPGA is at least a hundred times faster than that achievable with DSP or Matlab. Verification results obtained from the application of Iris set classification and flux training demonstrate that the error curve for the FPGA implementation closely resembles that observed during Matlab training. Furthermore, the output accuracy of the proposed method meets the requirements for practical applications.
Author Contributions: Conceptualization, software, validation, investigation, and writing-original draft preparation, Z.Z.; writing-review and editing, Z.Z., K.W., B.G. and G.C.; supervision, G.W. and K.W. All authors have read and agreed to the published version of the manuscript.