Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method

Zhang, Zhenyu; Wang, Guangsen; Wang, Kang; Gan, Bo; Chen, Guoyong

doi:10.3390/electronics12173607

Open AccessArticle

Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method

National Key Laboratory of Electromagnetic Energy, Naval University of Engineering, Wuhan 430030, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3607; https://doi.org/10.3390/electronics12173607

Submission received: 5 July 2023 / Revised: 21 August 2023 / Accepted: 23 August 2023 / Published: 26 August 2023

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

An efficient on-chip learning method based on neuron multiplexing is proposed in this paper to address the limitations of traditional on-chip learning methods, including low resource utilization and non-tunable parallelism. The proposed method utilizes a configurable neuron calculation unit (NCU) to calculate neural networks in different degrees of parallelism through multiplexing NCUs at different levels, and resource utilization can be increased by reducing the number of NCUs since the resource consumption is predominantly determined by the number of NCUs and the data bit-width, which are decoupled from the specific topology. To better support the proposed method and minimize RAM block usage, a weight segmentation and recombination method is introduced, accompanied by a detailed explanation of the access order. Moreover, a performance model is developed to facilitate parameter selection process. Experimental results conducted on an FPGA development board demonstrate that the proposed method has lower resource consumption, higher resource utilization, and greater generality compared to other methods.

Keywords:

FPGA; multiplexing; neural networks; on-chip learning; performance model

1. Introduction

The artificial neural network has wide applications in image recognition, prediction, control engineering, and other fields due to its advantages, such as strong nonlinear mapping ability, model independence, adaptability, and learning capability [1]. There are various forms of neural networks, with the convolutional neural network receiving significant attention in recent years, along with the Transformer-based architecture following the rise of ChatGPT. However, in 2021, Google’s MLP-Mixer rekindled interest in the multi-layer perceptron (MLP) [2]. The performance of MLP-Mixer rivals that of convolutional or Transformer-based architectures, indicating the untapped potential of MLP [3,4].

Furthermore, MLP remains the most commonly utilized neural network in practical engineering applications, particularly in the field of intelligent control. Its applications encompass modeling, decoupling, parameter estimation, fitting, diagnosis [5,6,7,8], etc. When it comes to implementation platforms, embedded systems are often selected, yet executing code sequentially on chips such as digital signal processors (DSPs) can be time-consuming. In contrast, field-programmable gate arrays (FPGAs) are better suited for neural network implementation due to their parallel structure, which accelerates the calculation process [9]. Consequently, FPGA has emerged as a popular platform for MLP implementation [10,11,12].

Nonetheless, the computation of neural networks implemented on FPGAs can exhaust resources, especially during the learning process, and the limited resources provided by FPGAs necessitate high resource utilization in the implementation method [13]. Moreover, the topology of neural networks varies across different applications. Since there is no universal formula for topology selection, the trial-and-error method is still the most commonly used strategy [14]. Consequently, the implementation method should possess generality across different topologies and resources. Additionally, the resource consumption of the common implementation methods is usually correlated with the concrete topology, which constrains the application ranges of these methods, especially for the application of neural networks with large topologies implemented on few FPGA resources. Several scalable architectures have been proposed, such as DLAU [15], DeepX [16], FP-DNN [17], etc. However, these accelerators map all operations from different network layers to the same hardware unit, resulting in the same degree of calculation parallel for different neural networks. Therefore, these architectures cannot meet various parallelism requirements of different neural networks or make full use of the reconfigurable characteristics of FPGA [18,19], which will cause the inadequate utilization of hardware resources, especially for the shallow neural networks.

Therefore, to address these challenges, this paper proposes an efficient on-chip learning method for MLP. Instead of solely performing off-chip learning for inference calculation [20], or solely acting as a computer assistant to accelerate part of the neural network calculation [21], the proposed method deploys the entire learning process of neural networks on a single FPGA, which is more friendly for embedded systems. The main contributions of this work are outlined below.

Firstly, an architecture of neural network hardware implementation with the configurable neuron calculation unit (NCU) is proposed, and the parallelism tunable architecture can meet with different parallelism requirements of different applications and increase the resource utilization rate. The proposed method is also highly parameterized to better adapt to different applications. It allows for a flexible configuration of topology, data bit-width, and other parameters.

Secondly, a weight adaptive adjustment strategy is proposed to better support the calculation of the proposed method. The weights are segmented and recombined according to the number of NCUs, thereby reducing the utilization of random-access memory (RAM) blocks.

Thirdly, multiple strategies are used to reduce resource consumption and calculation time. The proposed method adopts a “folded” systolic array wherein data is received serially while NCUs operate in parallel [22]. By this structure, only one memory is needed to store the output or sensitivity, and only one activation function module is utilized throughout the neural network, resulting in a further reduction of resource consumption. Meanwhile, the end time of each calculation stage is optimized to save computation cycles, ensuring fast calculations through a compact time sequence arrangement.

Finally, a performance model of the proposed method is constructed, which can evaluate the effect of different parameters beforehand to accelerate the parameter selection process.

The rest of this paper is organized as follows. Section 2 provides an introduction to the related works. The learning principle of neural networks is presented in Section 3. Section 4 extensively elaborates on the design of the proposed method. The strategy of weight storage and access is explained in Section 5. In Section 6, the performance model of the proposed method is described. The verification and comparison results of the proposed method are shown in Section 7. We then conclude our work in Section 8.

2. Related Works

Numerous studies have been conducted on the implementation of MLP in FPGA. In 1999, Izeboudjen proposed the parametric programming of neural networks using VHDL code [23]. The structure of the proposed method was parallel at the neuron level and serial at the layer level, allowing for the easy realization of different network topologies by modifying parameters. However, this work only supports the inference calculation of the neural network. In 2007, Izeboudjen extended the idea of parameterization to online training of neural networks [24], which adjusted topology by copying or deleting neural cores. Nevertheless, the number of layers in this work is not adjustable and the configuration steps are complex. To overcome these limitations, a more flexible FPGA implementation method for neural networks was proposed in [25]. This highly parameterized method made hardware configuration as flexible as software execution while maintaining performance comparable to other hardware implementations, whereas the resource consumption of the method is high. Another method proposed in [26] combined input and first hidden layer units in a single module, resulting in significantly reduced resource consumption compared to the method in [25]. Moreover, [27] introduced a more user-friendly tool, allowing for fast prototyping of neural network implementations through an intuitive graphical user interface (GUI). However, both of the two methods still required high consumption of DSP slices and RAM resources.

To enhance the calculation of neural networks and leverage the parallel capabilities of FPGA, the implementation of neural networks is usually calculated in parallel at the neuron level [28], while the calculation at the layer level has two commonly used strategies of pipelining and multiplexing. The pipelined structure, proposed in [29], facilitated online learning of the neural network on FPGA by executing inference and learning in a pipelined manner to accelerate the training process. In [30], a time-delay backpropagation algorithm was proposed, which supported simultaneous forward and backward computation and avoided the problem of backward locking. Despite the advantage of fast calculation, the pipelining strategy was resource intensive. Additionally, due to its difficulty in parameterization, the strategy struggled to accommodate different topologies.

On the contrary, the multiplexing strategy possesses superiority in terms of resource-saving, although its calculation speed is relatively low. In [31], the multipliers are utilized by the time division-based method, significantly reducing hardware complexity while also improving calculation speed. The initial application of the layer multiplexing method to the neural network calculation can be traced back to [32], where the complete neural network calculation was achieved by continuously multiplexing the resources of a single layer. As a result, a significant reduction in resource consumption, particularly in the context of multi-layer neural networks, was achieved. The idea of multiplexing was applied to a more general computing architecture of the feedforward neural network in [33]. This architecture proved to be applicable to various neural network models, such as multilayer perceptron, autoencoder, and logistic regression. An improved systolic array technique was used in [22], which multiplexed the process unit in sequence. The method reduces the amount of weight streams and is easier to scale. However, it must be noted that the aforementioned methods primarily focused on the inference aspect of the neural network and did not encompass the learning process. It was not until [13] that the layer multiplexing method was initially employed in the FPGA implementation of deep back-propagation (BP) learning. This approach achieved a considerable reduction in hardware resource consumption during the learning process.

Recognizing both the advantages and limitations of pipelining and multiplexing, a proposed integrated architecture that combines these two strategies was presented in [34]. This architecture provides flexibility in selecting between processing speed and resource consumption. Nonetheless, it exhibited limited adaptability to real-time calculations due to irregular time intervals between input samples. In [35], two different architectures were proposed, namely, the N-Fold architecture and the Flow architecture. The N-Fold architecture is the layer multiplexing method, while the Flow architecture is the layer parallel architecture. However, the two architectures are separated and the parallelism between them is non-tunable. The approach outlined in [36] divided the hidden layers into fixed layers (pipelined) and flexible layers (dual multiplexing) in order to conserve resources. Nevertheless, the flexibility of this approach was constrained as the flexible layers were restricted to dual usage. Although these hybrid methods strike a balance between resource consumption and calculation speed, their focus lies solely on network inference, the parameterization of which is still difficult.

Considering that the topology and the supplied resources vary across different applications, the multiplexing method is preferable due to its advantage of resource-saving and the convenience it offers for parameterization. Moreover, the calculation speed achieved by this method is sufficient to meet the requirements as long as the time sequences are properly arranged. In addition, by performing calculations on one layer at a time, the multiplexing method effectively avoids the issue of backward locking. Consequently, the multiplexing method is selected as the foundation of our research.

3. The Principle of the MLP

The BP algorithm, which is frequently utilized for training MLP [13], is a gradient-descent-based learning procedure that approximates the expected output by continuously adjusting the weight values according to the error gradient. The calculation process of MLP with the BP learning algorithm can be divided into three phases: forward calculation phase, backward calculation phase, and update phase. The forward calculation phase involves the forward inference calculation, where the input signal is propagated through the network in the forward direction until it reaches the output. The backward calculation phase is responsible for the error backpropagation calculation, during which the error signal propagates through the network in the backward direction until it reaches the input. The update phase focuses on the weight update calculation, where the weights are adjusted based on the forward direction of the network. Figure 1 depicts the structure of the calculations in one layer of MLP.

In Figure 1,

{\hat{V}}_{1} ~ {\hat{V}}_{m}

represent the inputs, which are also the outputs of the previous layer. W_ij represents the weight value from neuron i in the previous layer to neuron j in the current layer. b₁~b_n represent biases. u₁~u_n represent the outputs of multiply-accumulate (MAC), which can be obtained as follows:

u_{j} = \sum_{i = 1}^{m} W_{i j} {\hat{V}}_{i} + b_{j}

(1)

The output of MAC is the input of the activation function g, where g is commonly chosen as the hyperbolic tangent function:

V_{j} = g (u_{j}) = (e^{u_{j}} - e^{- u_{j}}) / (e^{u_{j}} + e^{- u_{j}})

(2)

The structure of the output layer is slightly different from Figure 1. The final output is correlated just linearly with the output of MAC:

y_{j} = C u_{j}

(3)

where C = 1 in this paper.

Assuming the expectation of the network is d, the error is as follows:

e_{j} = d_{j} - y_{j}

(4)

In the learning phase, the algorithm adjusts the weight to minimize the mean square error, and the value is as follows:

E = e^{T} e = \frac{1}{2} \sum_{i = 1}^{n} e_{i}^{2}

(5)

The BP algorithm adjusts the weight with a certain compensation along the opposite direction of the error gradient:

Δ W_{i j} = - η \frac{\partial E}{\partial W_{i j}} = - η S_{j} {\hat{V}}_{i}

(6)

where, ƞ is the adjustment step and S is the sensitivity, which can be obtained by the chain rule:

S_{j} = g^{'} (u_{j}) \sum_{j = 1}^{n} W_{i j} {\hat{S}}_{j}

(7)

where,

\hat{S}

represents the sensitivity value of the previous layer and

g^{'}

is the derivative of the activation function, which can be obtained as follows:

g^{'} (u_{j}) = 1 - V_{j}^{2}

(8)

The sensitivity value in the last layer is:

S_{j} = \frac{\partial E}{\partial e} \frac{\partial e}{\partial u_{j}} = - C e_{j}

(9)

The update of the weight is shown as follows:

W_{i j} = W_{i j} + Δ W_{i j}

(10)

4. The Proposed Neuron Multiplexing Method

4.1. Overview of the Proposed Method

The proposed neuron multiplexing method is an improvement of the layer multiplexing method, and the diagram of the traditional layer multiplexing method is shown in Figure 2a.

Since the layer with maximum neurons is selected as the multiplexing layer in the traditional layer multiplexing method, the resource consumption of which is still related to the concrete topology, so the application of the method is limited. In addition, the layer multiplexing method has low resource utilization when the neural number varies largely in different layers. For example, in Figure 3, a neural network with a topology of 3-4-2 is calculated according to the traditional multiplexing method, and the part of the neural network that is executing calculations is represented by solid lines, others are represented by shaded lines. The layer number in the figure represents the layer that is executing calculations. The calculation resources are configured according to the second layer, which has the largest neuron number of four, and it can be observed that two NCUs are idle in the calculation of the last layer, which causes the low resource utilization.

To address these issues, this paper proposes a neuron multiplexing method based on the layer multiplexing method, with the goal of achieving better resource utilization and generality. The proposed architecture is composed of configurable NCUs, which are the resources needed to complete the calculation of one neuron, and the calculation of any topologies can be realized through multiplexing NCUs. Figure 2b provides a diagram illustrating the neuron multiplexing method.

Upon observing Figure 2, it can be noted that the layer multiplexing algorithm can be viewed as a special case of the proposed method, wherein the number of NCUs is set to the maximum number of layers. If we consider layer multiplexing as the folding of neural networks in the horizontal dimension, the proposed neuron multiplexing method can be seen as an even further folding of the neural networks in the vertical dimension.

By employing the proposed method, it is possible to decouple the topology from the resources and enhance their utilization. For example, if the case depicted in Figure 3 is processed using the proposed method and the number of NCUs is set to two, the calculation process for this is demonstrated in Figure 4. The stage number in the figure represents the different executing stage of one layer of the proposed method.

In Figure 4, the number of NCUs is smaller than four, resulting in less resource consumption, and the NCUs are occupied all the time, which increases resource utilization. Since the calculation of layers that have neurons more than the number of configured NCUs will be divided into several stages, the calculation cycles will also increase slightly. The detailed analysis of performance will be shown in Section 6.

The subsequent subsections will introduce the detailed implementation design of the proposed method. To provide a clear introduction, we denote the neural number of the previous layer as N_i₋₁, the neural number of the current layer as N_i, and the number of NCUs as k.

4.2. The Top-Level Module

The top-level module mainly contains the interfaces with the outside, which is shown in Figure 5.

At the top-level module, the samples, initial weights, and expectations are inputs from external sources, which can be acquired through the retrieval of corresponding tables. The inner modules generate the addresses of these tables, which are then output from the top-level module. The “TrainFlag” signal is utilized to determine whether to execute the interface-only algorithm or the entire training process. This determination is manifested through the state machine depicted within the dashed rectangle, when the signal on the arrow is valid, the state machine performs the corresponding transition. If “TrainFlag” is zero, the state machine will transition directly to “State = 1” upon completion of “State = 2”; otherwise, it will proceed to “State = 3” and carry out the entire calculation. (Refer to Table 1 for a description of the state machine.)

The “NewCycleStart” signal is the start signal of the new cycle, which is determined by the state machine and the enable signal from outside. “Out” and “Error” signals are the output and error after inference is completed.

4.3. Multipliers and Weight Storage

To make the design clearer, the forward calculation phase, backward calculation phase, and update phase correspond to three independent modules, namely, forward calculation block, backward calculation block, and update block. In all three phases, multipliers and weights are utilized and multiplexed to minimize resource usage. The interconnections between the multipliers, weight storage, and other blocks can be found in Figure 6.

In Figure 6, thick lines represent signal clusters composed of multiple signals. Considering that the multipliers and weights serve as public resources with numerous connections to other modules, they are not designed as separate modules and the connections are not present in the figure but are represented by dashed lines instead. All signal names starting with “F”, “B”, and “U” indicate signals originating from the forward calculation block, backward calculation block, and update block, respectively.

The inputs of the multipliers switch based on the state signal, enabling multiplexing of the multipliers during different phases. With the exception of the multiplication within the activation function module and the constant multiplication within the update module, all other multiplications are accomplished using the multipliers depicted in Figure 6. As a result, the total number of multipliers utilized throughout the algorithm is only two more than the number of NCUs.

The output of the multipliers is truncated to prevent excessive resource consumption in accordance with the configured data bit-width. To reduce the critical path delay, two cycles of delay are introduced after the output. Additionally, the multipliers offer flexible configuration options, allowing them to be mapped to either DSP slices or LUT slices of the FPGA, with the former being the default mapping choice.

The “Weight Rd&Wr” module consists of several dual port RAMs to store weights, which can support simultaneous reading and writing. The module is automatically mapped to distributed RAM or block RAM determined according to the storage size. The read addresses of the RAMs are switched according to the state signal, while the write-relevant signals are switched according to the end signal of weight initialization, as the writing of weights is only required in the weight initialization or update phase. The weight initialization module is used to generate the reading and writing-related signals of weights in the weight initialization phases.

4.4. Forward Calculation Block

The forward calculation block is responsible for performing the forward inference calculation of the neural network, which corresponds to Equation (1) through to (4). The diagram of this block is depicted in Figure 7.

The start signal of the block derives from the end signal of the previous stage, or from the external start signal of the new cycle for the first stage of the first layer. When the start signal arrives, the status of layers and stages will be updated first. A counter named “Counter1” will start at the same time, which increases from 0 to N_i₋₁. The address offset of weight and output V can be obtained by looking up the corresponding offset table according to the current status of layers and stages. The address of weight and V can be then obtained by adding the counter and the offset value.

The read address of weight “F_W_RdAddr” will be output to Figure 6 after delays to access the RAM of weight, and the accessed weight “Wout” is input to the multipliers in Figure 6. The read address of V is input to the RAM of V through a selector, as the V is also used in the backward and update phases. The accessed value of V is also input to the multipliers in Figure 6, and then multiple multiplication results “MulOut” will be obtained in parallel and are inputted to the accumulation module through “In2”. The signal flow of MAC is shown in Figure 8.

The architecture illustrated in Figure 8 can be regarded as a “folded” systolic array [22], which exhibits similarity to a conventional systolic array when the Multiply-Accumulate units (MACs) of the architecture are vertically unfolded. The MAC is comprised of the multipliers depicted in Figure 6 and the accumulation module within the forward calculation block. The input traverses through the MACs sequentially, effectively supporting the storage of a single value V in one RAM for each clock cycle.

The values of “MacOut” are converted into serials through Selector2 to reduce the usage of the activation function module and facilitate the storage of V. The design of the activation function is our previous work, so it will be only briefly introduced here. The activation function is realized through a hybrid method, the fast-changing region of which is approximated by the method of the lookup table with interpolation [25], and the slow-changing region of which is realized using the range addressable lookup table method [37]. The optimal data bit-width of the method is selected automatically according to the expected accuracy. Compared with other methods, the proposed approximation method can save more hardware resources under the same expected accuracy. Since only one activation function module is used for the entire neural network, its accuracy can be appropriately increased to achieve a more accurate calculation of the neural network.

“Counter2” starts after the “Counter1” ending for a few cycles, which is not only served as the selection signal of “Selector2” but also added with the offset address of V to constitute the write address. The end signal of “Counter1” is also input into the “End Determine” module to determine the end time of both stages and the forward calculation phases. Intuitively, the calculation of each stage should not end until the storage of V is finished. However, the end time can be advanced because the storage can be executed simultaneously with the calculation in the next stage. The determination of the end time should avoid the storage conflict in different stages, which means the storage of output in the current stage should be finished before starting the storage in the next stage. Therefore, the end time in the forward calculation phase is shown in Figure 9a.

In Figure 9a, “Counter3” starts as the MAC operation finished and ends as the stage ends. If the next stage is in a new layer, the inputs are outputs of the current layer and an additional bias. Since the number of cycles spent on MAC is just the number of inputs, and the number of cycles needed for storage is one more than the number of outputs in the current stage (one for bias), the cycles spent on the MAC operation in the next stage is not fewer than the cycles needed for storage in the current stage. Considering the inputs in the next layer are obtained through RAM access, the first output must be stored before the new stage starts. Therefore, the end time can be defined at the moment of the first output stored in the RAM, i.e., Couner3 = 3.

If the next stage is not in a new layer, the inputs of the next stage are the outputs of the previous layer with an additional bias. The number of cycles needed for storage in each stage is fewer or equal to k + 1 since the number of NCUs is k and the extra one is reserved for bias storage. When N_i₋₁ + 1 ≥ k + 1, the cycles spent on MAC in the next stage are still enough for the current output stored in RAM. Due to the outputs in the current stage not being used in the next stage, the end time can be determined at the moment as the MAC is finished, i.e., Couner3 = 1. When N_i₋₁ < k, the cycles spent on MAC in the next stage are not enough for the current output stored in the RAM, hence the end time has to be delayed to the time of Counter3 = k – N_i₋₁. In this way, the cycles spent on MAC are just enough for the leftover output stored in RAM, avoiding the storage conflict in the next stage. The signal of the “StageEnd” in the last stage of the last layer can also be used as the end signal of the forward calculation phase.

4.5. Backward Calculation Block

The backward calculation block is used to calculate the sensitivity of the neural network, which corresponds to (7)~(9). The diagram of the backward calculation block is shown in Figure 10.

Since some components in the forward calculation block are similar to those in the backward calculation block, they will not be reintroduced. The “folded” systolic array depicted in Figure 8 is also implemented in the backward calculation block with a different weight access order. This approach effectively avoids simultaneous RAM access. The details of the access order will be explained in Section 5. To accommodate the cross-access of different RAMs required in the backward calculation phase, the “Weight Addr Cal” and “WTrans” modules are necessary to determine the appropriate weight order.

Based on Equations (7) and (8), the multipliers in the backward calculation phase must perform multiple multiplications as opposed to just the multiplication in the MAC operation. Consequently, the inputs of the multipliers are switched according to the inner state machine, as illustrated by the dashed rectangle in Figure 10. Within this figure, “BState” represents the signal of the internal state machine, “MacEnd” indicates the completion of the MAC operation, “DerivativeEnd” signifies the completion of the derivative calculation, and “MulEnd” marks the completion of all multiplications in the current stage. As a result, the multipliers are multiplexed to serve multiple functions, thereby conserving resources.

The end time of each stage in the backward calculation block is shown in Figure 9b, which is similar to that in the forward calculation block. The only difference is that there is no activation function calculation in the backward calculation block, so the storage will start immediately after the completion of MAC. Therefore, the time of the first sensitivity value stored in the RAM is also the moment that MAC finishes, i.e., Couner3 = 1.

Unlike in forward calculation phases, the “StageEnd” signal in the last stage of the last layer cannot be directly used as the end signal of backward calculation phases. The end time determination of the backward calculation phase is shown in Figure 9c, where “Counter4” starts as the stage end and ends as the backward calculation phase ended. The “StageEnd” signal in the backward calculation phase is the same moment as the first sensitivity value stored in RAM, but the access of V may not end at this moment as the value of V is used to calculate the derivative of the activation function in (8). Therefore, if the backward calculation phase is the end at this moment, the access of V may be conflicted in the backward calculation and update phases. Since the “StageEnd” signal is seven clock cycles later than the first access of V, the end signal of the backward calculation phases can be set to the “StageEnd” signal if the “ActiveLast” is less than or equal to seven, where the “ActiveLast” is the neuron number active in the calculation in the last stage of the last layer. Whereas, the end signal of backward calculation phases should be ActiveNum-7 cycles later than the “StageEnd” signal if the “ActiveLast” is more than seven.

4.6. Update Block

The update block is mainly used for the update of weight, which corresponds to (6) and (10). The diagram of the update block is shown in Figure 11.

Since the value of V and S are all stored in the RAMs and only one value is accessed in one clock cycle, the value of S is locked in the registers after every access to support the parallel multiplication, and the constant multiplication in (6) is performed after the access of S. The signal flow of multiplication is shown in Figure 12.

In Figure 12, the inputs flow across the multipliers in sequence, executing the parallel multiplication with ղS, which can make full use of the multipliers. The adjustments of weight can be obtained after the multiplication, the values of which are added to the original weights to obtain the new values, and then updated in the storage. Once the input of V and S completes at the current stage, the input to the next stage can be immediately executed, as there is no accumulation or selection operation involved in the update block, unlike in the forward or backward calculation block. As a result, the end of the stage can be determined at the moment when the access to V and S is completed. Furthermore, the “StageEnd” signal in the last stage of the final layer can function as the concluding signal for the update phase.

5. Weight Storage and Access

To better adapt to the proposed architecture and further reduce the usage of the RAM block, a new weight storage and access method is proposed and will be introduced in this section. To make a clear introduction, set m = N_i₋₁, n = N_i, floor(m/k) = γ, floor(n/k) = λ, where floor means round to the nearest integer less than or equal to the element.

5.1. Weight Storage

The number of RAMs should be enough to support the simultaneous access of weights, and the weight number in one RAM should be as more as possible to reduce the use of the RAM block. The maximum simultaneous access number of weight is k in the proposed method, and the weights are therefore segmented and recombined to be stored in k RAMs.

The weight matrix in the forward and backward calculation phases are shown in Figure 13a,b, where the number in the square is the subscript of the variables, and the bias and corresponding weight are placed in the dashed square since they are not used in the backward calculation phases. The square with the same color also indicates that they are stored in the same RAM.

The weights are stored in “RAM_W1~k” according to the access order of weight in the forward calculation. Firstly, the weights in the first k rows of Figure 13a are stored in “RAM_W1~k” in sequence, and then the weights in the k + 1 row are stored from “RAM_W1” again. The weights in the subsequent rows follow the same storage rule. The storage order in one layer is shown in Figure 13c.

In addition, the weights belonging to the same neuron position in different layers are recombined in the order of the layers. For example, the storage order of weights in the network of topology 3-4-5-2 with three NCUs is shown in Figure 14.

It can be observed in Figure 14 that the start address of the weights in different layers vary, hence the address offsets of different layers are tabulated beforehand. Moreover, the weights are initialized individually following the order of “RAM_W1~k”, which differs from the weight matrix format. Therefore, an m file script that can convert the weight matrix to the proposed form is created, which is convenient for initialization.

5.2. Weight Access

Since the weights are stored according to the access order of the forward calculation, the access order in the forward calculation phase is just the sequence of storage with certain cycles delay, in which the delays are used for the flow of the systolic array. The access order is shown in Figure 15a. Conversely, the weight access in the backward calculation phase is relatively complex on account of the cross-access of different RAMs, which is shown in Figure 15b.

In Figure 15b, the first row in stage 1 is W₁₁~W_1n, and W₁₁ is from “RAMW_1” at “Clk1”, W₁₂ is from “RAMW_2” at “Clk2”, and W_1k is from “RAMW_k” at “Clkk”. Then, at “Clkk + 1”, W_1(k+1) is from “RAMW_1” again, and the following obey the same rule. In the next stage, the access weight is from W_(k+1)1 and the order is the same as that in stage 1 as well as the following stage. The access order not only satisfies the simultaneous access of k RAMs but also avoids the simultaneous access in one RAM. The relevant pseudo-code to obtain the read address of the weight in the backward calculation phase is shown in Algorithm 1, where “ActiveNum” is the number of NCUs actually executed in the current stage, and “LayerOffset” is the offset of the current layer, which is obtained by looking up the offset table of the layer.

Algorithm 1: Get Read Address of Weight

1:: Input: Start, LayerOffset, ActiveNum
2:: Param: m, n, k
3:: Output: RdAddr
4:: if Start then
5:: En ← 1
6:: end
7:: if En then
8:: if Count == k – 1 then
9:: if Offset ≥ n then
10:: Offset ← 0
11:: Count ← 0
12:: En ← 0
13:: else
14:: Offse t← Offset + m + 1
15:: Count ← 0
16:: end
17:: else
19:: Count ← Count + 1
20:: end
21:: end
22:: Addend = min(Count, ActiveNum)
23:: RdAddr = Offset + Addend + LayerOffset

In the pseudo-code, the “Addend” is bounded by min(Count, ActiveNum) due to the possibility that the “ActiveNum” may be less than k in the final stage.

Additionally, the delays shown in Figure 15 appear to introduce more calculation cycles. However, the overall calculation cycle is not increased since the parallel outputs of the Multiply-Accumulate (MAC) units are converted into a serial data stream. Taking the calculation of sensitivity in the backward calculation phase as an example, the analysis of time sequence is shown in Figure 16, where “SCalTime” is the time used by MAC and T_k is the time when the storage starts.

In Figure 16a, despite all the MAC finished at the time of T_k, the storage complete time is T_k₊₃, since the store is serial. While in Figure 16b, the calculation completed time of S_i is T_i_+k−1, which is i − 1 cycle later than T_k due to the input delay. However, the storage completed time is still T_k₊₃ because the storage moment is just the calculation complete time of S_i. As a result, the delay arrangement will not increase the whole calculation time, with only a slight increase in register usage.

Through the above strategy of weight storage and access, only the k RAM module is needed, which maximizes support for parallel computing while reducing the usage of RAM blocks.

6. Performance Model

The proposed method is characterized by a highly parameterized design, featuring a configurable topology, data bit-width, and NCU number, thus enabling its wide application. In order to expedite the parameter selection process, a performance model is established for evaluating the proposed method. The performance index encompasses essential metrics, including calculation cycles, resource consumption, and maximum frequency, among others.

The calculation cycles of each layer during the forward calculation phase can be expressed as follows:

F_Cycle_Layer (i) = N_{i}_{- 1} + 10 - (i = = L) \times 5 + (ceil (N_{i} / k) - 1) \times (N_{i}_{- 1} + 6 + \max (1, k - N_{i}_{- 1}))

(11)

where L is the number of layers of the neural network, and ceil means round to the nearest integer greater than or equal to the element.

The equation is consistent with the flowchart in Figure 9a. In the equation, N_i₋₁ + 10 − (I == L)×5 is the cycles consumed on the stage whose next stage is in the new layer, where N_i₋₁ + 10 is the cycles spent on MAC, RAM reading and writing, activation function calculation, and the multiplier pipelines, etc. In the last layer, the end signal of the stage is brought forward to the moment of MAC completion, i.e., 5 cycles ahead. N_i₋₁ + 6 + max(1, k − N_i₋₁) is the cycle consumption of the stage whose next stage is not in the new layer, and ceil(N_i/k) − 1 is used to calculate the times of these stages.

The cycles spent on the forward calculation phases are:

F_Cycle = (\sum_{i = 2}^{L} F_Cycle_Layer (i))

(12)

The cycles spent on one layer of the backward calculation phase are similar to that in the forward calculation phase:

B_Cycle_Layer (i) = N_{i}_{+ 1} + 8 + (ceil (N_{i} / k) - 1) \times (N_{i}_{+ 1} + 7 + \max (1, k - N_{i}_{+ 1}))

(13)

Considering the case in Figure 9c, the cycles spent on the backward calculation phases are as follows:

B_Cycle = (\sum_{i = 2}^{L - 1} B_Cycle_Layer (i)) + 1 + \max (1, BActive_Last - 7)

(14)

where “+1” is used to separate different blocks. The calculation cycles of one layer in the update phase are as follows:

\begin{array}{l} U_Cycle_Layer (i) = floor (N_{i} / k) \times (\max ((N_{i - 1} + 1), k) + 1) + \\ \min (Remainder, 1) \times (\max ((N_{i - 1} + 1), Remainder) + 1) \end{array}

(15)

Since the end of the stage is set at the finish moment of V and S in the update phase, the cycle of one stage in the update phase is max((N_i₋₁ + 1),k) + 1, and “+1” is for reading from RAM, whilst floor(N_i/k) is used to calculate the stage number with full k NCUs execution. In the last stage, the active NCU number may be less than k, so the end time is brought forward to max((N_i₋₁ + 1), Remainder) + 1, where Remainder is the remainder of (N_i/k).

The cycles spent on the update phases are as follows:

U_Cycle = (\sum_{i = 2}^{L} U_Cycle_Layer (i)) + 2

(16)

where “+2” is for the reading and writing weight. The sum of cycles spent on the whole calculation is as follows, where “+1” is for loading samples:

Cycle = F_Cycle + B_Cycle + U_Cycle + 1

(17)

The resource consumption of the proposed algorithm is roughly linear with the number of NCUs and data bit-width, which can be observed in Section 7. Therefore, the modeling of resources can be as follows:

Resource \approx C_{R} \times k \times B_{W}

(18)

where C_R is a constant coefficient, which is varied in different types of resources.

B_{W}

is the data bit-width.

The maximum frequency of the proposed method is correlated with data bit-width, which can be approximated by:

MaxF \approx C_{BW} B_{W}^{(- C o f f)}

(19)

where C_BW and Coff are constant coefficients, which are varied in different types of FPGA, and MaxF is the maximum frequency.

In addition, an index of throughput/resources that comprehensively reflects resource utilization is also analyzed for a better comparison. The throughput is the sample number processed in one second. Therefore, the index of throughput/resource is as follows:

T_R = MaxF / (Cycle \times C_{R} \times k \times B_{W})

(20)

where T_R is throughput/resource.

According to the above equations, the resource consumption is increased and the maximum frequency is decreased as the data bit-width increases, which also results in a low throughput/resource, indicating a low resource utilization. Therefore, fewer bit-width should be selected under the premise of ensuring accuracy. The resource consumption is decreased and the calculation cycle is increased when the number of NCUs is decreased, which reflects the area-delay characteristic of FPGA. The index of throughput/resource is increased in general as the number of NCUs is decreased, except for the case when the number of NCU is the divisor of the number of neurons. Therefore, the number range of NCUs can be first determined by the timing and resource constraint, and then the number should be as few as possible and inclined to choose the divisor of the number of neurons.

7. Verification and Comparison

The proposed method was compared to other on-chip learning methods of MLP implemented on FPGA, and the processing speed of the proposed method was also compared across different platforms. Lastly, the proposed method was validated using two practical applications.

7.1. Comparison with Other Methods

To begin with, the results obtained from the proposed method were compared to those published in [26]. While the index of maximum frequency and calculation cycles were not directly provided in [26], they could be inferred from Table 4 in [26] and Figure 8 in the original reference. The maximum frequency in [26] is around 185M, and the inferred cycles in [26] is a rough estimate, not an exact value, which is only used for comparison. In the proposed method, the number of Neuron Computation Units (NCUs) was configured to be the maximum number of neurons in the layer. Using the same topology, the FPGA chip of Xilinx Virtex5 series, and the synthesis tool of ISE, a comparison of the two methods was conducted, and the results are presented in Table 2.

From Table 2, it can be observed that the maximum frequency of the proposed method is comparable to the method in [26], and the proposed method can save hardware resources significantly, with LUT saving about 77.4%, registers 65.1%, BRAM 92.1%, and DSP 26.3%. At the same time, the calculated cycle of the proposed method increases by 65.1%. To make a more comprehensive comparison, the throughput/resource of the two methods is compared and the increased percent of the proposed method is shown in Figure 17.

In Figure 17, the proposed method demonstrates a significantly higher index of throughput/resource compared to the method described in [26]. Specifically, the average throughput/LUT increases by 235.2%, throughput/register increases by approximately 114.0%, and throughput/BRAM increases by about 1816.7%. However, there is a minor decrease in throughput/DSP by 3.0%. These results indicate that the proposed method exhibits much higher resource utilization. Furthermore, it is worth highlighting that the hardware resource consumption can be further reduced while simultaneously increasing resource utilization by decreasing the number of NCUs in the proposed method. Taking the example of the topology presented in Table 2 with a configuration of 10-50-1, the performance index varies with different numbers of NCUs, as illustrated in Table 3. The implementation is still conducted in the FPGA chip of Xilinx Virtex5 series as the method in [26].

Table 3 shows that as the number of NCUs increased, the resource consumption increased and the cycles decreased, while the maximum frequency changes little. The resource consumption of different numbers of NCUs in both Table 2 and Table 3 and the corresponding fitting curves are shown in Figure 18a. For a better display, the resource of LUT and registers are divided by 100.

In Figure 18a, most points can fit the linear curves, indicating the linear relationship between the resource and the configured NCU number. Only the RAM resource of topology 10-50-1 with 50 NCUs deviates, and the reason is that the weight number in each RAM is few, so the RAMs of weight are all mapped to the distributed RAM instead of Block RAMs.

The calculation cycles of topology 10-50-1 varying with NCU number are present as the circles in Figure 18b. To verify Equation (17), the curve of inferred cycles varying with NCU is also presented in Figure 18b. The figure shows that the inferred cycles are consistent with the practically executed cycles, which validates Equation (17). The calculation cycles are also not always increased with the NCU number growth. For example, the calculation cycles have a slight growth as the NCU number increases from 25 to 49. Therefore, it is better to confirm the calculation cycles by Equation (17) in the parameter selection process.

The topology of 10-50-1 should still be taken as an example to analyze the relationship of throughput/resources and NCU number, as shown in Figure 19. For a better display, the values of throughput/LUT and throughput/register are multiplied by 1000.

The actual throughput/resources of Table 3 varying with the NCU number are presented as dots in Figure 19, while the lines in the figure are the throughput/resources inferred by Equation (20). Equation (20) is validated as the variation trend of the curve is consistent with that of the dot. The values of throughput/resources are generally increased as the number of NCUs decreased, except for some special cases, such as the number of NCUs being 9 and 10. Therefore, it is better to simulate the throughput/resources before the design. In addition, as is shown in Figure 17, the index of throughput/resource of our method is higher than the method in [26] at any topology in Table 2, except for the topology of 10-50-1. Therefore, the index of throughput/resource of the topology of 10-50-1 in [26] is also present in Figure 19 as the triangle. It can be observed that the index of throughput/resource in our method can easily exceed the method in [26] as the number of NCU decreased to a certain value.

In addition, the performance of the proposed method is also influenced by the data bit-width. Taking the topology 10-6-3-2 in Table 2 as an example, the performance index varying with different data bit-width is shown in Table 4. The form of bit-width is (BS,BI,BF), where BS is the sign bit-width for supporting negative numbers, and where BI and BF are integer bit-width and fractional bit-width. The implementation is still conducted in the FPGA chip of Xilinx Virtex5.

The resource consumption of different NCU numbers in Table 4 and the corresponding fitting curves are shown in Figure 20a, which is consistent with Equation (18). The maximum frequency of different NCU numbers and the inferred value of Equation (19) are shown in Figure 20b, which verifies Equation (19).

For a more comprehensive comparison, the proposed method is compared with the traditional layer multiplexing method, which is also the work of Ortega-Zamorano [13]. In [13], the resource consumption of one neuron is provided instead of the resource consumption of concrete topology. The type of FPGA used is also the Virtex5 serials, and the bit-width is B_I = B_F = 16. The comparison of the resource consumption of one neuron is shown in Table 5.

According to Table 5, the resource consumption of one neuron of the two methods is equivalent [13], except for the Block RAM, and the reason is that the motivation for RAM usage of the two methods is different, as the RAM resource is designed to be fully used in [13] while we try to save RAM as much as possible.

The calculation cycles of the topology in Table 2 can be inferred by the equation provided in [13], and the number of NCUs in the proposed method is configured as the maximum number of neurons in the layer. The cycle comparison result of the two methods is shown in Table 6.

In Table 6, compared to the method in [13], the proposed method consumes an average of 21.3% fewer calculation cycles due to the compact timing arrangement. The maximum frequency of the proposed method is about 160 M, while the value of the method in [13] is 200 M, so the maximum frequency of the proposed method is 20% lower. Therefore, it can be inferred that the two methods have an equivalent index of throughput/resource, which reflects that the resource utilization of the two methods is also equivalent, indicating that the layer multiplexing algorithm can be seen as a special case of the proposed method. However, the proposed method is more flexible as the number of NCUs is configurable, the resource consumption can further decrease, and resource utilization can further increase as the number of NCUs decreases.

Further, the proposed method is compared with the method in [27,30,38]. The comparison results are shown in Table 7, Table 8 and Table 9. The number in the bracket in the column of “Type” indicates the NCU number used in the proposed method.

Compared with the method in [27,30,38], our method may be inferior in terms of calculation speed. However, the proposed method has a significant advantage in terms of resource consumption, which becomes more pronounced as the number of NCUs decreases. Therefore, the overall resource utilization rate of the proposed method is higher, as the ratio of resource savings compensates for the lag in calculation speed. Additionally, the power consumption of the proposed method is lower than that of other methods.

All of the above comparison are summered in Table 10.

All in all, when compared with other methods, the proposed method exhibits similar or slightly inferior levels of maximum frequency or cycle consumption. However, it offers a significant advantage in terms of hardware resource consumption and resource utilization. Furthermore, this advantage can be further amplified with a decrease in the number of configured NCUs. Moreover, the flexibility of the proposed method makes it more convenient for broader applications.

7.2. Comparison of Calculation Speed in Different Platforms

Take the topology 2-3-4-2 of the neural network as an example, and the learning time of one sample by the proposed method implemented on FPGA is compared with that implemented on the platform of Matlab and DSP chip. The result is shown in Figure 21, the abscissa of which is logarithmically scaled for a better display.

The result in Figure 21 shows that the processing speed of the neural network implemented in the DSP chip is already much faster than that implemented in Matlab. However, the processing speed of the proposed method implemented on FPGA can be about a hundred times faster than that implemented in the DSP chip, which shows the superiority of FPGA implementation.

7.3. Verification

The proposed method is verified by two practical applications, all of which are implemented on the Xilinx Artix7 xc7a35t FPGA development board. The verification results are compared with those verified in Matlab, and the latter is more precise due to the use of the double-precision floating-point data form.

7.3.1. Iris Set Classification

The classification of a well-known Iris set is used to verify the proposed method [13]. The dataset is divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 4-5-3, and the data bit-width of the proposed method is selected as [1,7,16]. Given the same initial weight value, the comparison of the training and validation error curve is shown in Figure 22a, and Figure 22b is the partial enlargement of Figure 22a.

In Figure 22, the training and validation error curve of the proposed method implemented in FPGA is almost the same as the error curves of the neural network trained in Matlab. After training, the test dataset is used to verify the training result. Only one test error in all the 45 test data on both two platforms indicates that the identification accuracy reaches 97.7% and verifies the correctness of the proposed method.

Since increasing the number of NCUs can save the training cycles of one epoch, the convergence speed can also be accelerated and the total training time can be shortened by increasing the number of NCUs. If the maximum absolute error of training is set to 0.1, the training error curve with the different numbers of NCUs is shown in Figure 23.

7.3.2. Flux Training

The proposed method is expected to be further used in some practical control applications, such as the motor drive. Therefore, a flux training example is used to verify the proposed method. The data of flux, voltage, and current in a motor drive system is used as the learning dataset, in which the voltage and current are used as the input and the flux is used as the output. The dataset has more than 1000 samples, which are divided into training, validation, and generalization sets by 50%, 20%, and 30%. The neural network topology is 8-5-5-2 [32] and the data bit-width of the proposed method is selected as [1,7,18]. There are eight inputs because not only the voltage and current in the α and β axis of the contemporary cycle are used as inputs, but the values of the previous cycle are also applied as inputs. Given the same initial weight value, the training and validation error curve of the proposed method implemented on FPGA and the curves of the neural network trained in Matlab are compared in Figure 24a, and Figure 24b is the partial enlargement of Figure 24a.

In Figure 24, the training and validation error curves of the two platforms are about the same, with a minor difference in the small error region. The difference is caused by the large training samples and the truncation error of the fixed-point data used in FPGA. The training results are tested by the test dataset after the training is finished. The curves of the ideal flux, the flux calculated by Matlab, and the flux calculated by the proposed method implemented on FPGA are shown in Figure 25.

In Figure 25, the calculated flux of the proposed method has little deviation compared to the ideal value, which verifies the effectiveness of the proposed method.

8. Conclusions

An on-chip learning method for MLP implemented on embedded FPGA is introduced in this paper. The method proposed in this study is highly parameterized, allowing for easy adaptation to different applications. It is based on the utilization of the neuron multiplexing technique, which effectively decouples resource consumption from the specific topology of the neural network. As a result, the proposed method demonstrates greater applicability across a wide range of scenarios. To support the proposed method and further reduce RAM block usage, a novel weight segmentation and recombination method is presented, along with a detailed introduction of the weight access order. Additionally, a performance model is developed to evaluate the performance index of the proposed method and facilitate the parameter selection process. Furthermore, a comprehensive comparison is conducted between the proposed method and the alternative approaches, encompassing resource usage, calculation cycles, maximum frequency, and other relevant factors. The results obtained from the comparison reveal that the proposed method significantly reduces hardware resource requirements while maintaining equivalent performance in terms of maximum frequency. While the calculation cycle may be slightly longer, the overall resource utilization is higher. Moreover, as the number of NCUs decreases in the configuration, the advantages of the proposed method in terms of resource consumption or utilization become even more evident. Furthermore, it is noteworthy that the processing speed of the proposed method implemented on FPGA is at least a hundred times faster than that achievable with DSP or Matlab. Verification results obtained from the application of Iris set classification and flux training demonstrate that the error curve for the FPGA implementation closely resembles that observed during Matlab training. Furthermore, the output accuracy of the proposed method meets the requirements for practical applications.

Author Contributions

Conceptualization, software, validation, investigation, and writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z., K.W., B.G. and G.C.; supervision, G.W. and K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Nature Science Foundation of China under Grant 61902422 and in part by National Key Laboratory Foundation of China under Grant 202124E080.

Data Availability Statement

Data available on request due to restrictions privacy. The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Valencia, D.; Fard, S.F.; Alimohammad, A. An Artificial Neural Network Processor with a Custom Instruction Set Architecture for Embedded Applications. IEEE Trans. Circuits Syst. I: Regul. Pap. 2020, 67, 5200–5210. [Google Scholar] [CrossRef]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An All-MLP Architecture for Vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Liu, H.; Dai, Z.; So, D.R.; Le, Q.V. Pay Attention to MLPs. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
Zhao, Y.; Wang, G.; Tang, C.; Luo, C.; Zeng, W.; Zha, Z.-J. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP. arXiv 2021, arXiv:2108.13002. [Google Scholar]
Tran, H.N.; Le, K.M.; Jeon, J.W. Adaptive Current Controller Based on Neural Network and Double Phase Compensator for a Stepper Motor. IEEE Trans. Power Electron. 2019, 34, 8092–8103. [Google Scholar] [CrossRef]
Wang, Z.; Hu, C.; Zhu, Y.; He, S.; Yang, K.; Zhang, M. Neural Network Learning Adaptive Robust Control of an Industrial Linear Motor-Driven Stage With Disturbance Rejection Ability. IEEE Trans. Ind. Inf. 2017, 13, 2172–2183. [Google Scholar] [CrossRef]
Li, S.; Won, H.; Fu, X.; Fairbank, M.; Wunsch, D.C.; Alonso, E. Neural-Network Vector Controller for Permanent-Magnet Synchronous Motor Drives: Simulated and Hardware-Validated Results. IEEE Trans. Cybern. 2020, 50, 3218–3230. [Google Scholar] [CrossRef]
Pasqualotto, D.; Zigliotto, M. Increasing Feasibility of Neural Network-Based Early Fault Detection in Induction Motor Drives. IEEE J. Emerg. Sel. Top. Power Electron. 2022, 10, 2042–2051. [Google Scholar] [CrossRef]
Liu, Y.; Chen, Y.; Ye, W.; Gui, Y. FPGA-NHAP: A General FPGA-Based Neuromorphic Hardware Acceleration Platform With High Speed and Low Power. IEEE Trans. Circuits Syst. I: Regul. Pap. 2022, 69, 2553–2566. [Google Scholar] [CrossRef]
Orlowska-Kowalska, T.; Kaminski, M. FPGA Implementation of the Multilayer Neural Network for the Speed Estimation of the Two-Mass Drive System. IEEE Trans. Ind. Inf. 2011, 7, 436–445. [Google Scholar] [CrossRef]
Chine, W.; Mellit, A.; Lughi, V.; Malek, A.; Sulligoi, G.; Massi Pavan, A. A Novel Fault Diagnosis Technique for Photovoltaic Systems Based on Artificial Neural Networks. Renew. Energy 2016, 90, 501–512. [Google Scholar] [CrossRef]
Li, Q.; Bai, H.; Breaz, E.; Roche, R.; Gao, F. ANN-Aided Data-Driven IGBT Switching Transient Modeling Approach for FPGA-Based Real-Time Simulation of Power Converters. IEEE Trans. Transp. Electrif. 2023, 9, 1166–1177. [Google Scholar] [CrossRef]
Ortega-Zamorano, F.; Jerez, J.M.; Gómez, I.; Franco, L. Layer Multiplexing FPGA Implementation for Deep Back-Propagation Learning. Integr. Comput. Eng. 2017, 24, 171–185. [Google Scholar] [CrossRef]
Zine, W.; Makni, Z.; Monmasson, E.; Idkhajine, L.; Condamin, B. Interests and Limits of Machine Learning-Based Neural Networks for Rotor Position Estimation in EV Traction Drives. IEEE Trans. Ind. Inf. 2018, 14, 1942–1951. [Google Scholar] [CrossRef]
Wang, C.; Gong, L.; Yu, Q.; Li, X.; Xie, Y.; Zhou, X. DLAU: A Scalable Deep Learning Accelerator Unit on FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2016, 36, 513–517. [Google Scholar] [CrossRef]
Kim, L.-W. DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1441–1453. [Google Scholar] [CrossRef]
Haghi, P.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices. IEEE Trans. Circuits Syst. I 2020, 67, 3056–3069. [Google Scholar] [CrossRef]
Gong, L.; Wang, C.; Li, X.; Chen, H.; Zhou, X. MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks With All Layers Mapped on Chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2018, 37, 2601–2612. [Google Scholar] [CrossRef]
Nakahara, H.; Que, Z.; Luk, W. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; IEEE: New York, NY, USA, 2020; pp. 1–9. [Google Scholar]
Yang, C.; Feng, C.; Dong, W.; Jiang, D.; Shen, Z.; Liu, S.; An, Q. Alpha-Gamma Discrimination in BaF2 Using FPGA-Based Feedforward Neural Network. IEEE Trans. Nucl. Sci. 2017, 64, 1350–1356. [Google Scholar] [CrossRef]
Faraone, J.; Kumm, M.; Hardieck, M.; Zipf, P.; Liu, X.; Boland, D.; Leong, P.H.W. AddNet: Deep Neural Networks Using FPGA-Optimized Multipliers. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 115–128. [Google Scholar] [CrossRef]
Coutinho, M.G.F.; Torquato, M.F.; Fernandes, M.A.C. Deep Neural Network Hardware Implementation Based on Stacked Sparse Autoencoder. IEEE Access 2019, 7, 40674–40694. [Google Scholar] [CrossRef]
Izeboudjen, N.; Farah, A.; Titri, S.; Boumeridja, H. Digital Implementation of Artificial Neural Networks: From VHDL Description to FPGA Implementation. In Engineering Applications of Bio-Inspired Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 1999; pp. 139–148. [Google Scholar]
Izeboudjen, N.; Farah, A.; Bessalah, H.; Bouridene, A.; Chikhi, N. Towards a Platform for FPGA Implementation of the MLP Based Back Propagation Algorithm. Lect. Notes Comput. Sci. 2007, 4507, 497–505. [Google Scholar] [CrossRef]
Gomperts, A.; Ukil, A.; Zurfluh, F. Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks for Online Applications. IEEE Trans. Ind. Inform. 2011, 7, 78–89. [Google Scholar] [CrossRef]
Ortega-Zamorano, F.; Jerez, J.M.; Munoz, D.U.; Luque-Baena, R.M.; Franco, L. Efficient Implementation of the Backpropagation Algorithm in FPGAs and Microcontrollers. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1840–1850. [Google Scholar] [CrossRef] [PubMed]
Tisan, A.; Chin, J. An End-User Platform for FPGA-Based Design and Rapid Prototyping of Feedforward Artificial Neural Networks With On-Chip Backpropagation Learning. IEEE Trans. Ind. Inform. 2016, 12, 1124–1133. [Google Scholar] [CrossRef]
Savich, A.; Moussa, M.; Areibi, S. A Scalable Pipelined Architecture for Real-Time Computation of MLP-BP Neural Networks. Microprocess. Microsyst. 2012, 36, 138–150. [Google Scholar] [CrossRef]
Gironés, R.G.; Gironés, R.G.; Palero, R.C.; Boluda, J.C.; Boluda, J.C.; Cortés, A.S. FPGA Implementation of a Pipelined On-Line Backpropagation. J. VLSI Sign. Process. Syst. Sign. Image Video Technol. 2005, 40, 189–213. [Google Scholar] [CrossRef]
Senoo, T.; Jinguji, A.; Kuramochi, R.; Nakahara, H. Multilayer Perceptron Training Accelerator Using Systolic Array. IEICE Trans. Inf. Syst. 2022, E105.D, 2048–2056. [Google Scholar] [CrossRef]
Ezilarasan, M.R.; Britto Pari, J.; Leung, M.-F. High Performance FPGA Implementation of Single MAC Adaptive Filter for Independent Component Analysis. J. Circuits Syst. Comput. 2023, 2350294. [Google Scholar] [CrossRef]
Himavathi, S.; Anitha, D.; Muthuramalingam, A. Feedforward Neural Network Implementation in FPGA Using Layer Multiplexing for Effective Resource Utilization. IEEE Trans. Neural Netw. 2007, 18, 880–888. [Google Scholar] [CrossRef]
Medus, L.D.; Iakymchuk, T.; Frances-Villora, J.V.; Bataller-Mompean, M.; Rosado-Munoz, A. A Novel Systolic Parallel Hardware Architecture for the FPGA Acceleration of Feedforward Neural Networks. IEEE Access 2019, 7, 76084–76103. [Google Scholar] [CrossRef]
Dong, Y.; Li, C.; Lin, Z.; Watanabe, T. A Hybrid Layer-Multiplexing and Pipeline Architecture for Efficient FPGA-Based Multilayer Neural Network. Nonlinear Theory Its Appl. IEICE 2011, 2, 522–532. [Google Scholar] [CrossRef]
Baptista, D.; Sousa, L.; Morgado-Dias, F. Raising the Abstraction Level of a Deep Learning Design on FPGAs. IEEE Access 2020, 8, 205148–205161. [Google Scholar] [CrossRef]
Khalil, K.; Kumar, A.; Bayoumi, M. Reconfigurable Hardware Design Approach for Economic Neural Network. IEEE Trans. Circuits Syst. II 2022, 69, 5094–5098. [Google Scholar] [CrossRef]
Xie, Y.; Joseph Raj, A.N.; Hu, Z.; Huang, S.; Fan, Z.; Joler, M. A Twofold Lookup Table Architecture for Efficient Approximation of Activation Functions. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2540–2550. [Google Scholar] [CrossRef]
Siddhartha, S.; Wilton, S.; Boland, D.; Flower, B.; Blackmore, P.; Leong, P. Simultaneous Inference and Training Using On-FPGA Weight Perturbation Techniques. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), Naha, Okinawa, Japan, 10–14 December 2018; IEEE: New York, NY, USA, 2018; pp. 306–309. [Google Scholar]

Figure 1. Structure of one layer calculation in MLP.

Figure 2. (a) Diagram of the traditional layer multiplexing method. (b) Diagram of the proposed neuron multiplexing method.

Figure 3. Calculation process of the layer multiplexing method.

Figure 4. Calculation process of the proposed method.

Figure 5. Top-level module and the main state machine.

Figure 6. Multipliers and Weight Storage.

Figure 7. Diagram of the forward calculation block.

Figure 8. Signal flow of MAC.

Figure 9. (a) The end time of each stage in the forward calculation phase. (b) The end time of each stage in the backward calculation phase. (c) The end time of the backward calculation phase.

Figure 10. Diagram of the backward calculation block.

Figure 11. Diagram of the update block.

Figure 12. Signal flow of multiplication in the update block.

Figure 13. (a) Access order of weight in the forward calculation phase. (b) Access order of weight in the backward calculation phase. (c) Storage order of weight in one layer.

Figure 14. Weight storage of topology 3-4-5-2.

Figure 15. (a) Weight access in the forward calculation phases. (b) Weight access in the backward calculation phase.

Figure 16. (a) Time sequence of input without delays. (b) Time sequence of input with delays.

Figure 17. Throughput/resource comparison.

Figure 18. (a) Resource consumption varies with NCU Number. (b) Calculation cycles vary with NCU Number.

Figure 19. Throughput/resources varies with NCU Number.

Figure 20. (a) Resource consumption varies with data bit-width. (b) Maximum frequency varies with data bit-width.

Figure 21. Time comparisons in different platforms.

Figure 22. (a) Comparison of the error and validation curve of Iris set classification. (b) Partial enlargement.

Figure 23. (a) Training error curve of different NCUs. (b) Partial enlargement.

Figure 24. (a) Comparison of the error and validation curve of flux training. (b) Partial enlargement.

Figure 25. (a) Tested result of flux training. (b) Partial enlargement.

Table 1. Description of the State Machine.

State	Description
0	Initial startup phase, weights are initialized in this phase.
1	Waiting phase, waiting for the input of new samples.
2	Forward calculation phase.
3	Backward calculation phase.
4	Updating phase.

Table 2. Comparison Results of the Proposed Method and the Method in [26].

Architecture	Type	Resource				MaxF (MHz)	Cycles
Architecture	Type	LUT	Regis.	BR	DSP	MaxF (MHz)	Cycles
10-3-1	Orte.	6413	4151	69	5	/	47
10-3-1	Prop.	1186	965	2	5	218.8	59
10-6-3-2	Orte.	13,062	6767	76	12	/	56
10-6-3-2	Prop.	2331	1991	1	8	185.1	95
10-50-1	Orte.	59,335	22,763	116	52	/	94
10-50-1	Prop.	24,316	12,610	2	52	180.3	234
30-30-10-2	Orte.	48,547	19,008	107	43	/	127
30-30-10-2	Prop.	11,944	7513	18	32	180.4	226
50-10-10-5	Orte.	25,533	11,853	90	26	/	156
50-10-10-5	Prop.	4232	3312	10	12	179.1	209
60-15-10-5	Orte.	33,163	13,833	95	31	/	181
60-15-10-5	Prop.	5676	4697	13	17	177.9	244
Increased Mean (%)		−77.4	−65.1	−92.1	−26.3	1.0	65.1

Table 3. Resource Consumption with Different NCU Number.

NCU	Resource				MaxF (MHz)	Cycles
NCU	LUT	Regis.	BR	DSP	MaxF (MHz)	Cycles
50	24,316	12,610	2	52	180.269	234
35	15,908	9736	22	37	180.435	284
25	9990	7080	19	27	185.252	274
15	6108	4753	12	17	180.435	333
10	4098	3262	10	12	177.077	343
9	3766	2895	9	11	180.270	383
5	2493	1806	7	7	178.591	531

Table 4. Performance Index Varies with Different Data Bit-Width.

Bit-Width	Resource				MaxF (MHz)
Bit-Width	LUT	Regis.	BR	DSP	MaxF (MHz)
(1,2,4)	1464	1277	0	8	204.901
(1,2,8)	1945	1660	0	8	204.564
(1,2,12)	2331	1991	1	8	185.100
(1,2,16)	3082	2357	2	9	168.560
(1,4,16)	3077	2515	2	15	168.560
(1,8,16)	3566	2611	2	15	164.643

Table 5. Resource Comparison of Proposed Method and Method in [13].

	LUT	Regis.	BR	DSP
Orte.	1007	428	$n = \frac{Avail . RAM}{# neurons}$	1
Prop.	1001	530	0.61	1

Table 6. Cycle Comparison of Proposed Method and Method in [13].

Architecture	10-3 -1	10-6 -3-2	10-50 -1	30-30 -10-2	50-10 -10-5	60-15 -10-5	Increased Mean (%)
Orte.	92	122	233	275	284	329	-21.3
Prop.	59	95	234	226	209	244	-21.3

Table 7. Comparison Results of the Proposed Method and the Method in Ref. [30].

Platform	Architecture	Type	Resource				MaxF (MHz)	Power	Cycles
Platform	Architecture	Type	LUT	Regis.	BR	DSP	MaxF (MHz)	Power	Cycles
Xilinx Virtex UltraScale + (XCVU35P)	784-128 -64-10	SENOO.	391,166	501,788	784	4850	200	12	494
Xilinx Virtex UltraScale + (XCVU35P)	784-128 -64-10	Prop.(128)	13,253	45,983	66	130	103.4	6.967	2198

Table 8. Comparison Results of the Proposed Method and the Method in Ref. [38].

Platform	Architecture	Type	Resource				One Epoch’s Time
Platform	Architecture	Type	LUT	Regis.	BR	DSP	One Epoch’s Time
Xilinx Kintex-7 (XC7K410T)	4-7 -12-3	Sidd.	38,135	56,291	60	610	0.6 μs
		Prop.(12)	5365	3747	1	14	0.63 μs
		Prop.(7)	3357	2573	4.5	9	0.83 μs
	32-16 -8-16-32	Sidd.	317,309	368,341	93	336	1.3 μs
		Prop.(32)	18,241	11,233	17.5	34	1.93 μs
		Prop.(16)	9262	6707	9.5	18	2.4 μs

Table 9. Comparison Results of the Proposed Method and the Method in [27].

Platform	Architecture	Type	Resource			MaxF (MHz)	Cycles
Platform	Architecture	Type	LUT	BR	DSP	MaxF (MHz)	Cycles
Xilinx Virtex-4 (SX4VSX35)	1-1-1	Tisan.	322	4	23	122.489	56
	1-1-1	Prop.	854	0	3	145.560	37
	7-7-7	Tisan.	1167	28	71	96.516	90
	7-7-7	Prop.	4347	0	9	114.800	67
	7-2-4	Tisan.	612	12	43	106.09	80
	7-2-4	Prop.	2718	0	6	122.070	55

Table 10. The Summary of the Comparison Results.

Method	Resource Consumption	Speed		Resource Utilization
Method	Resource Consumption	MaxF	Cycles	Resource Utilization
Orte. [26]	Greatly superior	Equivalent	Inferior	Greatly superior
Orte. [13]	Equivalent	Slightly inferior	Slightly superior	Equivalent
SENOO. [30]	Greatly superior	Inferior	Inferior	Superior
Sidd. [38]	Greatly superior	Slightly inferior		Superior
Tisan. [27]	Greatly superior	Slightly superior	Superior	Greatly superior

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Wang, G.; Wang, K.; Gan, B.; Chen, G. Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method. Electronics 2023, 12, 3607. https://doi.org/10.3390/electronics12173607

AMA Style

Zhang Z, Wang G, Wang K, Gan B, Chen G. Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method. Electronics. 2023; 12(17):3607. https://doi.org/10.3390/electronics12173607

Chicago/Turabian Style

Zhang, Zhenyu, Guangsen Wang, Kang Wang, Bo Gan, and Guoyong Chen. 2023. "Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method" Electronics 12, no. 17: 3607. https://doi.org/10.3390/electronics12173607

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient On-Chip Learning of Multi-Layer Perceptron Based on Neuron Multiplexing Method

Abstract

1. Introduction

2. Related Works

3. The Principle of the MLP

4. The Proposed Neuron Multiplexing Method

4.1. Overview of the Proposed Method

4.2. The Top-Level Module

4.3. Multipliers and Weight Storage

4.4. Forward Calculation Block

4.5. Backward Calculation Block

4.6. Update Block

5. Weight Storage and Access

5.1. Weight Storage

5.2. Weight Access

6. Performance Model

7. Verification and Comparison

7.1. Comparison with Other Methods

7.2. Comparison of Calculation Speed in Different Platforms

7.3. Verification

7.3.1. Iris Set Classification

7.3.2. Flux Training

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI