Efficient VLSI Architecture for Training Radial Basis Function Networks

This paper presents a novel VLSI architecture for the training of radial basis function (RBF) networks. The architecture contains the circuits for fuzzy C-means (FCM) and the recursive Least Mean Square (LMS) operations. The FCM circuit is designed for the training of centers in the hidden layer of the RBF network. The recursive LMS circuit is adopted for the training of connecting weights in the output layer. The architecture is implemented by the field programmable gate array (FPGA). It is used as a hardware accelerator in a system on programmable chip (SOPC) for real-time training and classification. Experimental results reveal that the proposed RBF architecture is an effective alternative for applications where fast and efficient RBF training is desired.


Introduction
Radial basis function (RBF) [1,2] networks have been found to be effective for many real world applications due to their ability to approximate complex nonlinear mappings with a simple topological structure. A basic RBF network consists of three layers: An input layer, a hidden layer with a nonlinear kernel, and a linear output layer. The Gaussian function is commonly used for the nonlinear kernel. structure of the networks. The training of the centers in the hidden layer and the connecting weights in output layer are performed by software. Other RBF-based applications in embedded systems [15,16] are also implemented in a similar fashion.
In [17,18], the digital hardware architectures for RBF training have been presented. However, the training for centers is not considered in [17]. The training for connecting weights is based on incremental operations. The architecture in [18] is able to train both the centers and the connecting weights. All training operations are performed incrementally. Although the incremental training is more suitable for hardware implementation, the performance is dependent on the selection of learning rate. The value of learning rate may be truncated for the finite precision hardware implementation. Similar to the improper learning rate selection, the truncation of learning rate may result in a poor local optimum for RBF training.
The goal of this paper is to present a novel hardware architecture for real-time RBF training. The architecture is separated into two portions: the FCM circuit, and the recursive LMS circuit. The FCM circuit is designed for the training of centers in the hidden layer. The recursive LMS circuit is adopted for the training of connecting weights in the output layer. Both the FCM and the recursive LMS circuits are digital circuit requiring no learning rate.
The FCM circuit features low memory consumption and high speed computation. In the circuit, the usual iterative operations for updating the membership matrix and cluster centers are merged into one single updating process to evade the large storage requirement. In addition, the single updating process is implemented by a novel pipeline architecture for enhancing the throughput of the FCM training. In our design, the updating process is divided into three steps: Pre-computation, membership coefficients updating, and center updating. The pre-computing step is used to compute and store information common to the updating of different membership coefficients. This step is beneficial for reducing the computational complexity for the updating of membership coefficients. The membership updating step computes new membership coefficients based on a fixed set of centers and the results of the pre-computation step. The center updating step computes the center of clusters using the current results obtained from the membership updating step. The final results of this step will be used for subsequent RBF processing.
The recursive LMS circuit performs weight updating using the centers obtained from the FCM circuit. The recursive LMS algorithm involves large number of matrix operations. To enhance the computational speed of matrix operations, an efficient block computation circuit is proposed for parallel multiplications and additions. The block dimension is identical to the number of nodes in the hidden layer so that all the connecting weights can be updated concurrently. To facilitate the block computation, buffers for storing intermediate results of recursive LMS algorithm are implemented as shift registers allowing both horizontal and vertical shifts. Columns and rows of a matrix can then easily be accessed. All matrix operations share the same block computation circuit for lowering area cost. Therefore, the proposed block computation circuit has the advantages of both high speed computation and low area cost for recursive LMS.
To demonstrate the effectiveness of the proposed architecture, a hardware classification system on a system-on-programmable-chip (SOPC) platform is constructed. The SOPC system may be used as a portable sensor for real-time training and classification. The system consists of the proposed architecture, a softcore NIOS II processor [19], a DMA controller, and a SDRAM. The proposed architecture is adopted for online RBF training with the training vectors stored in the SDRAM. The DMA controller is used for the DMA delivery of the training vectors. The softcore processor is used for coordinating the SOPC system. Some parameters of the RBF training process are not fixed by hardware. They can be modified by the softcore processor to enhance the flexibility of the SOPC system. As compared with its software counterpart running on Intel I5 CPU, our system has significantly lower computational time for large training set. All these facts demonstrate the effectiveness of the proposed architecture.

The RBF Networks
This section reviews some basic facts of RBF networks. A typical RBF network revealed in Figure 1 consists of an input layer, a hidden layer and an output layer. The input layer contains n source nodes, where n is the dimension of the input vector x. The hidden layer consists of c neurons. A kernel function is associated with each neuron. A typical kernel function used in the RBF networks is the Gaussian kernel. Let φ i be the Gaussian kernel associated with the i-th neuron, which is defined as The v i in Equation (1) is the center associated with the i-th neuron. Both x and v i have the same dimension n. The σ 2 in Equation (1) is termed the radius of the Gaussian kernel. It is assumed in this study that all kernels have the same radius. The output layer contains only one neuron. Letŷ be the output of the neuron, which is given bŷ The w i is termed the connecting weights between the i-th neuron in the hidden layer and the output neuron. The RBF training usually involves the training of centers v i , and connecting weights w i , i = 1, ..., c.

Figure 1.
A typical RBF network.

FCM for the Training of Centers
The FCM can be effectively used for the training of centers. Let X = {x 1 , ..., x t } be a set of training vectors for RBF training, where t is the number of training vectors. The FCM computes v i , i = 1, ..., c, by separating X into c clusters. The v i is then the center of cluster i. The FCM involves minimization of the following cost function: where u i,k is the membership of x k in class i, and m > 1 indicates the degree of fuzziness. The cost function J is minimized by a two-step iteration in the FCM. In the first step, the centers v 1 , ..., v c , are fixed, and the optimal membership matrix {u i,k , i = 1, ..., c, k = 1, ..., t} is computed by After the first step, the membership matrix is then fixed, and the new center v i is obtained by The iteration continues until the convergence of J. From Equations (3) and (5), it follows that the membership matrix needs to be stored for the computation of cost function and centers. As the size of the membership matrix grows with the product of t and c, the storage size required for the FCM may be impractically large for hardware implementation.

Recursive LMS for the Training of Connecting Weights
The training of connecting weights is also based on the training set X = {x 1 , ..., x t }. Letŷ k be the output of RBF network when the input is the k-th training vector x k ∈ X. That is, from Equation (2) It then follows thatŷ In addition, letŷ be the vector containing all the outputs for the training set X, and From Equations (8) and (10), we see thatŷ = Aw (11) as the vector consisting of all the desired outputs for the training set X, where y k is the desired output associated with the input x k . Let be the square distance between y andŷ. It can be shown that [2] the LMS estimate of w minimizing E is given by Finding w based on Equation (14) involves the operations of matrix inverse and multiplication. The LMS estimate of w may therefore be difficult to be implemented by hardware when number of training vectors t and/or the number of centers c are large. An effective alternative to the LMS method is the recursive LMS. Given training set X, instead of computing w in one shot using Equation (14), the recursive LMS computes w incrementally. Suppose training vectors become available in sequential order. Without loss of generality, assume x 1 , ..., x k−1 and the corresponding outputs y 1 , ..., y k−1 are available. Define Based on x 1 , ..., x k−1 , the first (k − 1) rows of A can be evaluated. Let be the first (k − 1) rows of A. The LMS estimate of w based on A k−1 and y k−1 , denoted by w k−1 , can be computed by Equation (14) as Suppose a new data pair (x k , y k ) becomes available. Then instead of using all the k available data pairs to recompute the w k , the recursive LMS takes the advantage of the w k−1 already available to obtain w k . Define It can then be shown that and To initialize the algorithm, set where λ is a small positive number.

The Architecture
As shown in Figure 2, the proposed architecture for RBF training can be separated into two units: the FCM unit and the recursive LMS unit. The goal of the FCM unit is to compute the centers v i , i = 1, ..., c, given the training set X. Based on the centers produced by FCM unit, and the training set X, the recursive LMS unit finds the weights w i , i = 1, ..., c.   Figure 3 shows the architecture of the FCM unit, which contains six sub-units: The pre-computation unit, the membership coefficients updating unit, center updating unit, cost function computation unit, FCM memory unit, and control unit. The operations of each sub-unit are stated below.

Pre-Computation Unit
The pre-computation unit is used for reducing the computational complexity of the membership coefficients calculation. Observe that u i,k in Equation (4) can be rewritten as Given x k and centers v 1 , ..., v c , membership coefficients u 1,k , ..., u c,k have the same R k . Therefore, the complexity for computing membership coefficients can be reduced by calculating R k in the pre-computation unit. For the sake of simplicity, we set m = 2 for our design. Consequently, R k can be viewed as the sum of The architecture for computing R k is depicted in Figure 4, which can be divided into two stages. The first stage evaluates ||x k − v i || 2 . The second stage first finds the inverse of ||x k − v i || 2 , and then accumulate this value with i−1

The Membership Updating Unit
Based on Equation (22), the membership updating unit uses the computation results of the pre-computation unit for calculating the membership coefficients. Figure 5 shows the architecture of the membership coefficients updating unit. It can be observed from Figure 5 that, given a training data x k , the membership coefficients computation unit computes u 2 i,k for i = 1, ..., c, one at a time. The circuit can be separated to two stages. The first stage and the second stage of the pipeline are used for computing ||x k − v i || 2 R k and u 2 i,k , respectively.

Center Updating Unit
The center updating unit incrementally computes the center of each cluster. The major advantage for the incremental computation is that it is not necessary to store the entire membership coefficients matrix for the center computation. Define the incremental center for the i-th cluster up to data point when k = t, v i (k) then is identical to the actual center v i given in Equation (5). The architecture of the center updating unit is depicted in Figure 6. It contains a multiplier, an accumulator (ACC) array and a divider. There are two groups in the ACC array. The i-th ACC in the first group contains the accumulated sum k−1 j=1 x j µ 2 i,j . Moreover, the i-th ACC in the second group contains the accumulated sum k−1 j=1 µ 2 i,j . The outputs of the array are used for computing v i (k) using a divider.

Cost Function Computation Unit
Similar to the center updating unit, the cost function unit incrementally computes the cost function J. Define the incremental cost function J(i, k) as As shown in Figure 7, the circuit receives u 2 i,k and ||x k −v i || 2 from the membership coefficients updating unit. The product u 2 i,k ||x k − v i || 2 is then accumulated for computing J(i, k) in Equation (25). When i = c and k = t, J(i, k) then is identical to the actual cost function J given in Equation (3). Therefore, the output of the circuit becomes J as the cost function computations for all the training vectors are completed.

FCM Memory Unit
This unit is used for storing the centers for FCM clustering. There are two memory banks (Memory Bank 1 and Memory Bank 2) in the on-chip center memory unit. The Memory Bank 1 stores the current centers v 1 , ..., v c . The Memory Bank 2 contains the new centers v 1 , ..., v c obtained from the center updating unit. Only the centers stored in the Memory Bank 1 are delivered to the pre-computation unit and membership updating unit for the membership coefficients computation. The updated centers obtained from the center updating unit are stored in the Memory Bank 2. Note that the centers in the Memory Bank 2 will not replace the centers in the Memory Bank 1 until all the input training data points x k , k = 1, ..., t, are processed.

Employment of Shift Registers for Reducing Area Costs for Large Input Vector Dimension n
In the pre-computation unit, membership coefficient updating unit and center updating unit of the FCM, a number of vector operations are required. Each of these operations needs n adders, multipliers or dividers to operate in parallel. Therefore, as the input vector dimension n becomes large, the area costs will be high.
One way to reduce the area costs is to separate each of the input vectors x k and centers v i into q segments, where each segment contains only n/q elements. The vector operations are then performed over the segments. This requires only n/q adders, multipliers or dividers to operate in parallel. To implement the segment-based operations, each of the registers holding the input vectors x k and centers v i has to be implemented as a q-stage shift register. Each stage of the register consists of n/q elements (i.e., one segment). That is, the shift registers are able to fetch or deliver one segment at a time. The shift registers are then connected to an array of n/q adders, multipliers or dividers for vector operations with reduced area costs. The vector operations will not be completed until all the segments in the shift registers are processed. Therefore, the latency of the vector operations may increase by q-fold.
The shift-register based approach has a number of advantages. First of all, it does not change the basic architectures of the proposed FCM circuit. In fact, the FCM circuits with different q values share the same architectures for pre-computation, membership coefficient updating, center updating, and cost function computation. Only the registers holding input training vectors and centers may have different architectures. For the basic FCM circuit with q = 1, these registers are the simple n-elements parallel-in parallel-out registers. When q ≥ 2, these registers become q-stage shift registers with each stage consisting of n/q elements.
The second advantage is that it provides higher flexibility to the FCM circuit. It is especially helpful when the input vector dimension n is large. In this case, basic design with q = 1 is suited only for applications requiring fast speed computation. However, because of the large area costs, it is difficult to implement the circuit in small FPGA devices. This difficulty may be solved by the realization of FCM with larger q values, which usually requires significantly lower consumption of hardware resources.

Recursive LMS Unit
The architecture of recursive LMS unit is shown in Figure 8, which contains kernel Gaussian Computation unit, memory unit and matrix computation unit, and control unit.

Kernel Gaussian Computation Unit
The goal of kernel Gaussian computation unit is to compute φ i (x) given in Equation (1). Given x k and the centers v 1 , ..., v c , the kernel Gaussian computation unit calculates the φ 1 (x k ), ..., φ c (x k ), sequentially to produce the vector a k . Figure 9 shows the architecture of the kernel Gaussian computation unit. In addition to adders and multipliers, the architecture contains circuit for computing exponential function. This circuit is implemented by Altera Floating Point Exponent (ALTFP EXP) Megafunction [20]. Similar to the FCM circuit, the number of adders and multipliers in this unit grows with input vector dimension n. When n is large, the area costs for implementing the unit will be high. The shift register based approach employed in the FCM circuit can also be used here for reducing the area complexities. In this approach, each of the registers holding x k and v i is a q-stage shift register. The number of adders and multipliers become n/q. The hardware resource consumption can then be lowered.

Memory Unit
The memory unit is used to hold values required for the computation of recursive LMS algorithm shown in Equations (18) and (19). As depicted in Figure 10, there are 8 buffers (Buffers Y, W, P, G, S, H, T, and A) in the memory unit. When x k is the current training vector, the Buffer A stores a k obtained from kernel Gaussian computation unit. The Buffer Y contains the y k . The Buffers P and W consists of P k−1 and w k−1 , which are the computation results for the previous training vector x k−1 . Based on a k , y k , P k−1 and w k−1 , the matrix computation unit is then activated for the computation of P k and w k . The intermediate results during the computation are stored in the Buffers G, S, H and T. The P k and w k are then stored in Buffers P and W for the subsequent operations for the next training vector x k+1 . These buffers can operate as parallel-in parallel-out (PIPO), parallel-in serial-out (PISO), serial-in parallelout (SIPO), and/or serial-in serial-out (SISO) registers. The attributes of these buffers are summarized in Table 1.

Matrix Computation Unit
The matrix computation unit contains N 2-input multipliers, N 2-input adders, one N-input adder, and one inverse operator, as shown in Figure 11. The matrix computation unit therefore is able to perform c parallel multiplications and additions. The circuit operates in four modes, as shown in Figure 12

Control Unit
The control unit of the recursive LMS unit coordinates the operations of the kernel Gaussian computation unit, memory unit and the matrix computation unit. Figure 13 shows the state diagram of the control unit. As shown in Figure 13, the control unit operates in 13 states. State 0 reads x k from external bus, and computes a k using the kernel Gaussian Computation unit. The operations from State 1 to State 7 is to compute P k based on Equation (18). The operations from State 8 to State 12 then finds w k based on Equation (19). All the operations from State 1 to State 12 involve the Memory unit and Matrix Computation unit. For the operation of each state, the Memory unit provides the source data. The Matrix computation unit processes the source data. The computation results are then stored back to the Memory unit.
Note that each state may not be able to complete its operations in a single step. Because the Matrix Multiplication unit is able to perform up to c multiplications or additions at a time, when a state requires more than c multiplications or additions, multiple-step operations are required. Figure 14 shows the multiple-step operations of State 1, which compute P k−1 a k . Because there are c 2 multiplications in State 1, we need c steps to complete the operation, as revealed in the figure. Figures 15 and 16 show the multiple-step operations of States 2 and 3, respectively. For sake of brevity, Table 2 summarizes the operations of each state. The summary consists of the source and destination buffers provided by the Memory unit, the operation mode of the Matrix Computation unit, and the number of steps required for each state.

The Proposed Architecture for Online RBF Training and Classification
Suppose there are b classes to be classified. A direct approach to use the proposed architecture for RBF training is to train the b classes in one shot. However, this may require large number of training vectors t and large number of nodes c in the hidden layer to achieve high classification success rate. As a result, the hardware costs of the proposed architecture may be high. An effective alternative is to train one class at a time. That is, after the training, each class has its own centers v 1 , ..., v c and weights w 1 , ..., w c for RBF classification. In addition, because each training is for a single class only, the corresponding training vectors x 1 , ..., x t belong to the same class. Therefore, their desired RBF output values y 1 , ..., y t are identical. For sake of simplicity, let y = y 1 = ... = y t be the values of the desired output. During recursive LMS training, the buffer Y in the memory unit for storing desired RBF outputs should only need to be initialized as y before the training of each cluster. It is not necessary to update Buffer Y for each new input x k during the training process.
The training process can further be simplified by allowing the desired output y to be identical for the training of all the clusters. In this way, the buffer Y should only be initialized before the training of the first cluster. Its value will then be reused for the training of subsequent clusters.
This simplification is also beneficial for RBF classification after the training. It is not necessary to store the desired output for individual clusters because they share the same one (i.e., y). Given an input vector x for RBF classification, letŷ be the output of the RBF network for class i, and let E i = (ŷ − y) 2 be the squared distance between the desired output and the actual output. The vector x will be assigned to class i * , when The classification circuit for each class mainly contains the kernel computation unit shown in Figure 9, and c multipliers for the computation ofŷ based on Equation (2). It can be effectively implemented in a pipelined fashion. Replicated copies of the circuit with one for each class (i.e., b copies) can operate in parallel to further enhance the throughput of classification. The proposed architecture can be employed in conjunction with the softcore processor for on-chip learning and classification. As depicted in Figure 17, the proposed architecture is used as a custom user logic in a system-on-programmable-chip (SOPC) consisting of softcore NIOS II processor, DMA controller, ethernet MAC and SDRAM controller for controlling off-chip SDRAM memory. The NIOS II processor is used for coordinating all components in the SOPC. It receives training/test vectors from ethernet MAC and stores these vectors in the SDRAM. It is also able to deliver the training and classification results to external hosts via the ethernet MAC. In addition, the softcore processor is responsible for the initialization of the proposed architecture and DMA controller. The initialization of the proposed architecture involves the loading of the initial parameters to the FCM and recursive LMS circuits. These parameters include the number of centers c, the initial centers v i , i = 1, ..., c, for FCM circuit, and σ 2 , y, P 0 and w 0 for recursive LMS circuit. The parameters are all stored in registers and can be accessed by softcore processor. Allowing these parameters to be pre-loaded by softcore processors may enhance the flexibility of the SOPC system.
The proposed architecture is only responsible for RBF training. The input vectors for the RBF training are delivered by the DMA controller. In the SOPC system, the training vectors are stored in the SDRAM. Therefore, the DMA controller delivers training vectors from the SDRAM to the proposed architecture. After the RBF training is completed, the NIOS II processor then retrieves the resulting neurons from the proposed architecture. All operations are performed on a single FPGA chip. The on-chip learning is well-suited for applications requiring both high portability and fast computation.

Experimental Results
This section presents experimental results of the proposed architecture. We first compare the proposed RBF network with existing classification techniques. The comparisons are based on datasets from the UCI database repository [21]. There are 4 datasets considered in the experiment: Balance-Scale, BCW-Integer, Iris and Wine. These datasets provide useful examples for the classification of balance scale states, breast cancer diagnosis, iris plant recognition, and wine recognition. The datasets have different sizes, number of attributes, and number of classes. The description of the datasets is shown in Table 3. Table 3. Description of datasets.

Balance-Scale BCW-Integer Iris Wine
Sizes ( The classification success rate (CSR) is used to measure the performance of classification techniques. The CSR is defined as the number of input patterns that are successfully classified divided by the total number of input patterns. From Table 4, it can be observed that the proposed RBF network has highest CSRs for the datasets Iris and Wine. In addition, it has CSRs comparable to those of the best classifiers for the datasets Balance-Scale and BCW-Integer. The RBF network has superior performance because the centers and the connecting weights of the network can be effectively found by FCM and recursive LMS, respectively.
Next we compare FCM with the algorithm in [11] for selection of centers in RBF design for texture classification. The textures considered in the experiments are shown in Figure 18. The textures are labelled T1, T2, T3, T4 and T5 in the figure, respectively. The dimension of input vectors is n = 4 × 4. The comparisons are based on CSRs for 2-, 3-and 4-class texture classification (i.e., b = 2, 3 and 4). To achieve meaningful comparisons, all RBF networks are based on recursive LMS algorithms for the training of connecting weights. They only have different center selection algorithms. Table 5 shows the results of the comparison. For each texture classification experiment, the table reveals the largest CSR for each center selection algorithm. Because different number of centers may result in the same CSR, the lowest number of centers (i.e., c) yielding the CSR is also shown in the table. The RBF network with lowest c has the smallest hidden layer size, which is beneficial for subsequent training of connecting weights at the output layer. It can be observed from Table 5 that both center selection algorithms produce the same minimum number of centers for each experiment. For the experiment with b = 2, both methods also attain 100% CSR. Nevertheless, when b = 4, the FCM has superior CSR. Therefore, FCM is an effective alternative for RBF design.   We then evaluate the area complexities and latency of the proposed architecture. The area complexities are separated into four categories: the number of adders, the number of multipliers, the number of dividers and the number of registers. The latency is the training time. Table 6 shows the area complexities and latency of the proposed architecture. It can be observed from Table 6 that the area complexities of FCM mainly grows linearly with the vector dimension n. In addition, when the q-stage (q > 1) shift register is used, the area costs can be reduced. On the other hand, the area complexities of the recursive LMS increases with both the vector dimension n and the number of neurons in the hidden layer c. Moreover, the area costs grow inversely with q. The training time of FCM and recursive LMS increase with c and q. It also grows linearly with the number of training vectors t.  To further evaluate the area complexities, the physical implementation of the proposed architecture is considered. The design platform is Altera Quartus II [27] with SOPC Builder and NIOS II IDE. The target FPGA device for the hardware design is Altera Cyclone III EP3C120. Table 7 show the hardware resource consumption of the proposed architecture for vector dimensions n = 4 × 4 and the number of neurons c = 8 in the hidden layer, respectively. The FCM circuit is the basic circuit with q = 1. The hardware resource utilization of the entire SOPC systems is also revealed in Table 7 for comparison purpose. Three different area resources are considered in the tables: Logic Elements (LEs), embedded memory bits, and embedded multipliers. The LEs are used for the implementation of adders, dividers, and registers in the proposed architecture. Both the LEs and embedded memory bits are also used for the implementation of NIOS CPU of the SOPC system. The embedded multipliers are used for the implementation of the multipliers of the proposed architecture. It can be observed from the table that the entire SOPC consumes 84% of the LEs of the target FPGA device. Because the area costs grow with n, the extension of the circuit with q = 1 to larger n may be difficult.  Table 8 shows the effectiveness of using shift register based approach for RBF hardware design. The input vector dimension is extended from n = 4 × 4 to n = 8 × 8. That is, the vector dimension is increased 4 folds. We set q = 4 to reduce the number of adders, multipliers and dividers for the new vector dimension. In fact, from Table 8, we can see that the SOPC can still be implemented in the target FPGA even with n = 8×8. Note that, as compared with LE consumption in Table 7, the LE consumption for n = 8 × 8 and q = 4 only slightly increases from 84% to 90%. Without the employment of shift registers for vector operations, the LE consumption will exceed the capacity limit of the target FPGA device for n = 8 × 8.  Table 9 compares the area costs of proposed FCM with those of the FCM architecture presented in [28] for various c values with q = 1. The dimension of input vectors is n = 2 × 2. Because the architecture presented in [28] uses broadcasting scheme for membership coefficients and center computation, we can see from Table 9 that the proposed architecture consumes significantly less hardware resources. As the number of neurons reaches 32, the architecture presented in [28] consumes 114117 LEs, which is 97% of the LE capacity of the target FPGA device. In contrast, when c = 128, the proposed architecture only consumes 92295 LEs, which is 78% of the LE capacity of the target FPGA device.  Table 10. The measurement of speed is the throughput, which is defined as the number of classifications per second. The hardware circuits are implemented in different FPGA devices. Therefore, it may be difficult to have direct comparisons on the hardware resource consumption. Our comparisons are based on the facts that logic cells (LCs) are the major hardware resources available in the FPGA devices for [29,30]. The LCs and LEs have similar structures, which contain a look-up table (LUT) and a 1-bit register. Therefore, the comparison of hardware resource consumption is based on the number of LEs or LCs used by the circuits. The CSR of these circuits are measured from the same dataset Iris obtained from UCI repository. There are 4 attributes for each instance. Therefore, all the circuits have the same input vector dimension n = 4. Two RBF networks with different number of centers (c = 2 and c = 4) are considered in this experiment. From Table 10, we can see that the proposed architectures with c = 2 and c = 4 have the highest throughput and CSR, respectively. They outperform the architecture in [29] in throughput, LE/LC consumption and CSR. They also have higher speed as compared with the architecture in [30]. The proposed RBF with c = 2 has slightly lower CSR as compared with the RBF with c = 4. However, it has higher throughput and lower LE consumption. Therefore, when both the speed and hardware consumption are important concerns, the RBF with c = 2 is more effective. Alternatively, when high speed computation and high CSR are desired, the RBF with c = 4 may be a better selection.
The proposed RBF architecture also features fast training time for texture classification. To illustrate this fact, Table 11 compares various hardware and software implementations for the training. The number of textures considered in this experiment is b = 4. The textures T1, T3, T4 and T5 shown in Figure 18 are used as the training images for the experiments. The CPU time of the proposed RBF architecture and the architecture in [31] are measured by the NIOS II 50 MHz softcore CPU in the SOPC platform. The architecture [31] is based on the generalized Hebbian algorithm (GHA) [2]. The number of principal components in the GHA is 7. Both the implementation of RBF and GHA architectures are based on the same FPGA device (i.e., Cyclone III). It can be observed from Table 11 that the proposed architecture achieves comparable CSR to that of GHA architecture with significantly lower computation time. In fact, the training time of the proposed architecture is only 4.32% of that of its software counterpart (126.68 ms vs. 2927.38 ms). As compared with the GHA algorithm based on incremental updating/training processes, its training time is only 2.42% of that of GHA architecture (126.68 ms vs. 5240.92 ms). All these facts demonstrate the effectiveness of the proposed architecture.

Concluding Remarks
A novel RBF training circuit capable of online FCM training and recursive LMS operations has been realized. Its FCM implementation consumes less hardware resources as compared with existing FCM designs. The hardware recursive LMS can also expedite the training at the output layer. Experimental results reveal that the proposed RBF network has superior CSR over existing classifiers for a bundle of datasets in the UCI repository. In addition, the proposed RBF architecture has superior speed performance over its software counterparts and other architectures for texture classification. In fact, the RBF network is able to attain CSR of 98% for Iris plant classification. Moreover, it has CSR of 96% for the 4-class texture classification. The training time of the RBF architecture is only 126.68 ms. By contrast, the training time of its software counterpart is 2927.38 ms. In addition, for the low cost FPGA devices such as Altera Cyclone III, only 84%, 60% and 39% of logic elements, embedded multipliers, and embedded memory bits are consumed for dimension n = 4 × 4. The proposed architecture therefore is an effective alternative for on-chip learning applications requiring low area costs, high CSR, and high speed computation.