Parallel Fixed Point Implementation of a Radial Basis Function Network in an FPGA

This paper proposes a parallel fixed point radial basis function (RBF) artificial neural network (ANN), implemented in a field programmable gate array (FPGA) trained online with a least mean square (LMS) algorithm. The processing time and occupied area were analyzed for various fixed point formats. The problems of precision of the ANN response for nonlinear classification using the XOR gate and interpolation using the sine function were also analyzed in a hardware implementation. The entire project was developed using the System Generator platform (Xilinx), with a Virtex-6 xc6vcx240t-1ff1156 as the target FPGA.


Introduction
Artificial neural networks (ANNs) are computational techniques that employ mathematical models inspired by the neural structures of intelligent organisms, which acquire knowledge based on past and present experience. These intelligent organisms possess extremely complex sets of cells, called neurons; the structure of an ANN is composed of processing units called artificial neurons, whose functioning involves parallel and distributed interconnections [1]. One of the popular ANN architectures is the radial basis function (RBF) networks that employ least mean square (LMS) algorithms in their training.
From the implementation perspective, one of the main problems of RBFs relates to the lack of a methodology for the definition of their topology in terms of the number of centers, which, on the one hand, has a positive influence on their computational cost. There are several heuristic techniques available to assist the designer with network topology [1]. However, there is a consensus that the topology depends on the problem under investigation [2]. A commonly used procedure is to test various topologies for different sets of information. Nonetheless, this practice is time-consuming and can only be applied in cases of off-line training or when statistical information concerning the data space is known.
Various solutions for the implementation of application-specific integrated circuits (ASICs) have been proposed in order to accelerate the functioning and training of ANNs [3][4][5]. However, implementations in ASICs fix the architecture and the algorithm implemented, resulting in poor flexibility and/or high cost. Nevertheless, with advances in reconfigurable hardware structures, there has been renewed focus on the implementation of ANNs, and dedicated hardware structures are available that are flexible in terms of their topology and training algorithm. Currently, the most widely used architectures for reconfigurable hardware are the FPGAs, which can provide performance similar to an ASIC, with the advantage of rapid prototyping. There have been several studies concerning the implementation of multilayer perceptrons (MLPs) in FPGAs [2,3,[6][7][8][9][10][11], the implementation of RBFs in FPGAs [12][13][14][15][16][17][18][19][20][21], as well as the implementation of SOMin FPGA [22].
This work presents a parallel hardware implementation of an RBF-type ANN. The implementation is made at a fixed point and is aimed at reconfigurable hardware architectures of the type FPGA. Different from earlier studies [2,3,[6][7][8][12][13][14][15][16][17][18][19][20], here, a detailed analysis of the implementation of the RBF is provided, considering aspects, including processing time, delay time and the precision of the response. The proposed method was tested using two scenarios that have been widely reported in the literature. The first concerns the problem of nonlinear classification associated with the XOR gate, and the second considers the problem of interpolation using the sine function. All of the results were obtained using the System Generator (Xilinx) development platform [23], with a Virtex-6 xc6vcx240t-1ff1156 FPGA. The System Generator is a design tool over Simulink of MATLAB [24].

Radial Basis Function Networks
RBF networks have become increasingly popular in the domain of artificial networks [1]. The structure of an RBF network consists of multiple layers; the processing is the feedforward type, and the training can be either supervised (as used in the present work), or hybrid, where a supervised method is combined with an unsupervised method.

Architecture
The basic structure of an RBF network ( Figure 1) consists of only three layers. The first layer is the connection of the model with the medium and is composed of p inputs. The second (or hidden) layer is composed of H radial basis functions (also known as neurons) and performs a nonlinear transformation of the input vector space into an internal vector space, whose dimensions are usually larger (H > p). The final (output) layer transforms the internal vector space into an output using a set of M linear neurons. . . .
Radial basis functions only produce responses that are significantly different from zero when the input pattern is located within a small region of the domain, and each function requires a center and a scaling parameter, β. The most widely used radial basis function is the Gaussian function [1]. The h-th function, ϕ h (·), can be expressed by: where β = 1 2σ 2 and v h (k), which represents the input distance, x(k), in relation to the h-th center, The vector of centers associated with the h-th radial function is characterized as: and the vector of inputs as: Equation (2) can therefore be rewritten and expressed by: The output of the m-th neuron of the output layer, N m , can be characterized as: Substituting Equation (1) in Equation (6) gives: where w mh is the synaptic weight between neuron h of the hidden layer and neuron m of the output layer. At each instant k, the RBF network receives a vector of inputs, x(k), and generates an output vector, y(k), expressed by:

Training Algorithm
Due to the difference between the hidden layer and the output layer, the training of RBF networks is usually divided into two parts. The first part concerns the nonlinear optimization performed by the hidden layer. In this, the input vector, x(k), is processed by means of functions present in the hidden layer, characterized by Equation (1). Several different strategies can be used to determine the centers, c h [1]. Different from conventional methods, where fixed centers are selected randomly [1], the strategy employed here was to select fixed centers deterministically. The second part of the training involved calculation of the weights, w mh , between the hidden layer and the output layer. The weights can be obtained using the pseudo-inverse method [1] or the LMS algorithm [25]. The online LMS procedure used here is an iterative technique that optimizes the mean squared error (MSE) function using the gradient descent method with classical stochastic estimation (exchanging the mathematical approach in favor of an instantaneous estimate) [25]. The parameters are adjusted every instant, k, using the expression: where µ is the learning rate and e m (k) is the training error associated with the m-th output neuron in the k-th instant. The error can be expressed by: where d m (k) is the desired value for the m-th output neuron. The online LMS is less computationally complex, compared to the pseudo-inverse method that requires a nonquadratic matrix inversion calculation in order to obtain the weights, w mh (k) [25].  as described in Equation (6). In this module, the M neurons also function in parallel, generating the signal y(k) (see Equation (8)) from the input, x(k). (3) Updating algorithm module (UAM): This module is responsible for implementing the updating algorithm, which, in the present case, is the online LMS. The updating of each synaptic weight, w mh (k), at the k-th instant, is performed in parallel, according to Equation (9). As shown in Figure 2, the weights, w mh , are stored in local registers of n bits, here termed RWmh.
The ILM, OLM and UAM modules function sequentially. For each instant k, the vector of inputs, x(k), is processed to generate the output vector, y(k), and all of the synaptic weights are calculated in order to update the registers, RWmh. It is important to note that it would be possible to further improve the performance of the RBF by using the modules in the form of a pipeline (this was not employed in the present work).

Radial Basis Functions
Figures 3 and 4 present details of the processing associated with the h-th radial basis function of the ILM. Figure 3 illustrates the implementation corresponding to Equation (5). In this case, the overall implementation is performed in a partially parallel manner, as described previously [2,7,26]. The delays associated with the additions and multiplications are also shown in Figure 3. Each addition can be implemented with a delay z −a and each multiplication with a delay z −r . The insertion of delays in the operations relaxes the conditions of routing between the cells in the FPGA, principally in complex operations, such as those involving multipliers. On the other hand, the introduction of delays slows the response of the system, which can often be disadvantageous [23]. The calculation of the total delay, D, with respect to Equation (5), can be expressed as:  (2) and (5). ...
A variety of techniques are available for the calculation of nonlinear functions in hardware, and in the case of FPGAs, one of the most common is the use of lookup tables (LUTs) in ROM memory [2,7,26]. Figure 4 shows the proposed implementation for the h-th radial function (see Equation (1)), in which read-only memory (ROM) was used to develop the LUT. For each h-th radial function, there is an LUTh and a multiplication operation to adjust the scaling parameter, β (stored in the BETAh register). In order to obtain greater resolution associated with each h-th LUTh, the v h (k) variable can be rescaled to a new format expressed by Un.b LU T h , in which b LU T h is a new value for the number of bits of the fractional part, calculated using: where: Equation (13) Figure 4 illustrates the re-scaling step performed after the multiplication operation and before the LUT step.
As v h (k) · β > 0, for any k, the value of the response of the radial function is limited to 0 ≤ s h (k) ≤ 1 and can therefore be represented at a fixed point as Ub LU T h .b LU T h , in which the number of bits of the integer part is zero. The values stored in the h-th LUT can be characterized by the vector LUT h , expressed as: where P is the depth of the LUT, characterized as: and: where: The delay associated with Equation (1), D 1 , can be expressed as the sum of the delay corresponding to the calculation of distance (see Equation (11)) and the delay corresponding to processing of the LUT: where q is the LUT delay (z −q )

Output Layer Neurons
The implementation of each m-th neuron, N m , of the output layer is illustrated in Figure 5. Similar to the processing presented in Figure 3, the sums of the products amongst the inputs and the synaptic weights were also implemented in a partially parallel manner [2,7,26].  (6)).
In the case of the implementations associated with the output neurons, the addition operations can have delays of z −u samples, and the multiplication operations can have delays of z −c samples. The total accumulated delay, from the input to the output of the RBF, can be expressed by:

Online LMS Algorithm
The implementation of the online LMS associated with each synaptic weight, according to Equation (9), is presented in the diagram shown in Figure 6. All M × H weights are updated in parallel and stored in the registers, RWmh. The RMU register stores the value of the learning rate, µ, and to avoid problems in realigning the algorithm, the operations of addition, subtraction and multiplication are implemented with zero delay.

Delays of Operations
Due to the reduction in the size of transistors and increases in clock frequency, delays caused by interconnection paths (also known as routing delays) are one of the dominant factors affecting time restrictions. In FPGAs, routing delays are caused by programmable routing switches, which significantly increases the wire delay [27]. Techniques, such as wire pipelining (or delay padding), in which flip-flops or latches are inserted between critical stages, can reduce the paths in order to achieve the necessary time restrictions [27,28]. Figure 6. Updating of the synaptic weights, w mh (k), according to Equation (9).
Hence, based on the technique of wire pipelining, it can be seen from Equations (11), (18) and (19) that there are delays associated with the operations of addition, subtraction, multiplication and ROM reading in the structures presented in Figures 3-5. The delays are implemented by the addition of flip-flops or latches after the operations, represented here by the variables a, r, u, c and q. It is important to highlight that the operations (addition and multiplication) in the updating circuit of the synaptic weights, w mh (k) (see Figure 6), do not use delays (wire pipelining technique). This restriction ensures synchronization of updating of the weights with their corresponding inputs.
It can be seen from the case studies presented in [29][30][31][32][33] using Virtex-6 that the value of the delay influences the maximum clock frequency associated with the arithmetic operation in question. For example, in [29], it was observed that the operation of addition in Virtex-6 for variables of 32 bits with a signal could achieve a maximum clock frequency of 410 MHz, using a delay of z −3 , while a maximum clock frequency of 388 MHz was obtained in the absence of any delay.

Analysis of the Area Occupied
The occupied area (in FPGA) of the RBF network can be expressed as: A RBF (n, a, r, c, u, p, H, M, α, γ, η, κ, P, D 1 , D 2 ) = A ILM (n, a, r, p, H, P, γ, α) where: where A ILM , A OLM and A U AM are collections formed by the number of flip-flops, N F F , number of LUTs, N LU T and number of embedded multipliers, N M ult , used by modules ILM, OLM and UAM, respectively. N BRAM is the number of embedded blocks RAMs, for which ILM can be used to implement the H ROMs that represent the radial function [34]. It is important to observe that the ROMs can be also implemented for logic cells.
The ILM module can be expressed by: where N Sub F F (n, a) and N Add where N Sub LU T (n, a) and N Add LU T (n, a) are the number of LUTs required to implement a subtraction and addition of n bits with a delay of a, respectively. N M ult LU T (n, r) is the number of LUTs required to implement a multiplication of n bits with a delay of r. N RC LU T (n) and N BET A LU T (n) represent the number of LUTs required to implement each register RChp and BETAh, respectively. The variable N ROM LU T (n, q, P ) is the number of LUTs used to implement, with logic cells, the h-th ROM with depth P and n bits (see Equation (15)). N ROM BRAM LU T (n, q, P ) represents the situation where the ROM is implemented by block RAMs. Finally, N Rout LU T (n) is the number of LUTs used for routing [34]. The number of embedded multipliers used can be expressed as: and the number of blocks RAMs as N ILM BRAM (n, H, α, P ) = (H − α)N ROM BRAM (n, P ) (27) where N ROM BRAM (n, P ) is the number of blocks RAMs needed to implement each h-th ROM with depth P and n bits (see Equation (15)).
For the OLM module: where η represents the portion of HM multipliers (in OLM) that are implemented by logic cells. The variables N Delay where κ represents the portion of the 2HM multipliers that are implemented in logic cells. N RW F F (n) and N RW LU T (n) represent the number of flip-flops and LUTs required to implement each register RWmh, respectively. The number of embedded multipliers in UAM can be expressed by:

Results and Experimental Tests
Two operational scenarios were analyzed in order to validate the implementation of the RBF in an FPGA. The first scenario was a widely known problem of nonlinear classification, in which the RBF attempted to copy the functioning of the XOR gate. In the second scenario, the RBF performed interpolation of the sine function. Tables 1-3 present the parameters used in the experimental tests involving the two scenarios. The results were obtained using the System Generator development platform (Xilinx) [23] and a Virtex-6 xc6vcx240t 1ff1156 FPGA. The Virtex-6 FPGA possesses 37, 680 slices grouping 301, 440 flip-flops, 150, 720 LUTs that can be used to implement logic functions or memories and 768 DSPcells (Virtex-6 FPGA DSP48E1) with multipliers and accumulators [23,29,33]. In both scenarios, the signal sampling rate was R s = 1 Ts , where T s is the time between the k-th samples. 1 Scaling parameter associated radial function (β) 2 Learning rate (µ) 0.3125 Table 2. Parameters used for the sine function interpolation scenario.

Number of inputs (p) 1 Number of centers (H)
4 Number of neurons of the output layer (M ) 1 Scaling parameter associated radial function (β) 0.125 Learning rate (µ) 0.5 Table 3. Delays associated with the arithmetical operations used in the two scenarios.
In the results described in this section, the operations of addition and subtraction were implemented with logic cells utilizing LogiCORE IP Adder/Subtracter v11.0 (Xilinx) [30]. All of the multiplication operations were implemented using embedded multipliers (γ = η = κ = 0) of the type Virtex-6 FPGA DSP48E1 Slice with LogiCORE IP Multiplier v11.2 [29,33], and all of the ROMs associated with the radial functions were implemented using logic cells (α = H) with LogiCORE IP Distributed Memory Generator v6.3 [32].
In the Virtex-6 FPGA, the arithmetic operations (addition, subtraction and multiplication) do not possess restrictions related to the delays (a ≥ 0, r ≥ 0, u ≥ 0 and c ≥ 0). Meanwhile, as described in Subsection 3.5, the technique of wire pipelining can assist in the routing process, reducing the distance between the operations in order to achieve the time restrictions. In the case of the ROM, implementation in the Virtex-6 FPGA requires the use of internal memory blocks (Virtex-6 block RAMs) with q > 0. In the absence of this condition, q ≥ 0 (in this case, the ROM is implemented by logic cells) [31,32]. Table 3 presents all of the delay values used in the simulations, where the values selected were based on test cases presented in [29][30][31], where various delay values and the maximum frequencies obtained are illustrated. For example, in [29], it was found that the multiplication operation for variables of 18 bits with a signal could achieve a maximum clock frequency of 450 MHz, using a delay of r = 3. Tables 4 and 5 present the results obtained after the process of synthesis of the RBF (with the parameters presented in Tables 1-3) in the FPGA. For both scenarios, it can be clearly seen that the sampling rate and the area of occupation were highly sensitive to the number of bits. On the other hand, due to the parallelization, differences between the two scenarios were not large, using the same quantity of bits (14, 15 and 16).  Occupation of the area used by the registers is due to the storage of fixed centers (RChi), constants (Betah, RMU), synaptic weights (RWmh) and delays. This is affected to a greater extent by the architecture of the ANN (number of inputs, p, number of centers, H, and number of outputs, M ) than by the number of bits. Occupation of the logic cells is related to the addition operations performed using the construction of logic functions, as well as the radial functions that were developed by means of the construction of ROM memories, as described in Section 3.2. In the latter case, the degree of precision and the architecture of the network exert a direct influence, as shown in Tables 4 and 5. The multiplication operations were synthesized in internal DSP circuits, as a result of which, the area consumed by these operations remained constant, in terms of the number of bits, and only altered in terms of structural changes in the network.  (Figure 7), the MSE was calculated using frames of 16 samples, and 40 frames were tested. For the scenario involving interpolation of the sine function (Figure 8), frames of 4096 sample were utilized to calculate the MSE, and 128 frames were tested. In both cases, the RBF implemented showed satisfactory convergence, with the best results being directly related to the number of bits, n.

Estimate of the Area Occupied
Using as a reference the parameters presented in Tables 1-3, Equations (20)- (27) and the estimates provided by the Xilinx manufacturer (see Tables 6 and 7) [29][30][31][32][33], it was possible to obtain estimated area occupation limits for the flip-flops and LUTs with n = 8 and n = 32 bits, as shown in Figures 9-12. These figures also show the real results obtained after the synthesis in the case of the XOR gate (n = {12, . . . , 16}) and the sine function (n = {14, . . . , 18}). Since embedded (fixed) multipliers were used, the multiplications did not influence the area occupied. However, it is important to point out that if the onboard multipliers are not sufficient, multipliers could be constructed with logic cells. It is important to observe that the number of bits will result in an exponential growth in the occupied area. Table 6. Estimated number of flip-flops provided by the Xilinx manufacturer [29][30][31][32][33].  Table 7. Estimated number of LUTs provided by the Xilinx manufacturer [29][30][31][32][33].         Figures 13 and 14 show occupation estimates for various numbers of inputs, p, and centers, H, for M = 1 and n = 8. The curves demonstrate that the occupation capacity is more influenced by the number of bits, n (exponential increase), than by the structure of the RBF that grows linearly with the parameters p, H and M . The curves illustrated in Figures 13 and 14 show that, for example, with eight bits, it is possible to work with a relatively large RBF (p = 64, H = 64 and M = 1) using around 50% occupancy of a Virtex 6 (40, 000 LUTs and flip-flops more multipliers). This RBF configuration could be used in adaptive equalizers [35,36] and in some applications in mobile robotics [37].

Real-World Cases
Artificial neural networks of the type RBF can be applied to different real-world problems that may or may not require online training. Amongst these problems can be highlighted the use of RBF networks in adaptive equalizers for wireless communication systems [35,36], for obstacle avoidance in mobile robotics [37], in pattern recognition [38], as well as other applications. Nevertheless, for problems that require online training, the speed of the training in terms of the number of iterations per second is a fundamental determinant of viability in dedicated hardware. The development of the present work is based on this perspective.
Analysis of the results obtained for time (or, in other words, the sampling rate, R, of the ANN) showed that in both cases (XOR gate and sine function), the value was much more closely associated with the number of bits, n, than with the structure of the network, which was due to the parallel implementation proposed in this paper. From the results obtained, it could be seen that if space exists in the FPGA for the implementation of the ANN, the sampling rate essentially depends on the number of bits. For example, in the case of the FPGA used here, it was possible to achieve sampling rates of around 100 MHz (see Tables 4 and 5); in other words, to work at 100 mega-iterations per second.

Conclusions
This work presents a parallel fixed point implementation, in an FPGA, of an RBF trained with an online LMS algorithm. The implementation of the RBF was analyzed in terms of occupation area, bit resolution and processing delay. The proposed structure was tested at different resolutions, using two widely known scenarios, namely the problem of nonlinear classification with the XOR gate and the problem of interpolation utilizing the sine function. The results obtained were highly satisfactory, indicating the potential feasibility of the technique for use in practical situations of greater complexity.

Author Contributions
Marcelo A. C. Fernandes conceived and designed the experiments; Alisson C. D. de Souza performed the experiments; Marcelo A. C. Fernandes and Alisson C. D. de Souza analyzed the data; Marcelo A. C. Fernandes wrote the paper.

Conflicts of Interest
The authors declare no conflicts of interests.