Efficient Architecture for Spike Sorting in Reconfigurable Hardware

This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.


Introduction
A spike train is a sequence of action potentials generated by a neuron. Extracellular neural recording often results in a mixture of these trains from neurons near the recording electrodes. Spike sorting is the process of segregating the spike trains of individual neurons from this mixture. Spike sorting is a difficult task, due to the presence of background noise and the interferences among neurons in a local area. A typical spike sorting algorithm involves computationally demanding operations, such as feature extraction and clustering [1]. For applications, such as brain machine interface (BMI) [2], involving the control of artificial limb movements, spike sorting systems need to have the ability to process raw spike trains in real time, because the typical delay between neural activity and human limb movement is only several hundred milliseconds [3]. One way to expedite the spike sorting computation is to implement the algorithms in hardware.
A common approach for hardware design is based on application-specific integrated circuits (ASICs). A major drawback of ASICs is the lack of flexibility for changes. With the wide range of spike sorting algorithms that already exist and the continual design and improvement of algorithms, the ability to easily change a spike sorting system for new algorithms is usually desired. However, the modifications in ASIC are very difficult, especially when chips are implanted in the brain. The field programmable gate array (FPGA) [4] is an effective alternative to ASIC for hardware implementation with lower NREcosts. Moreover, the circuits in an FPGA are reconfigurable, thereby providing higher flexibility to a spike sorting architecture for future extensions.
The goal of this paper is to present an FPGA-based spike sorting hardware architecture for real-time computation. The architecture is able to perform both feature extraction and clustering in hardware, which are the most computationally demanding tasks for spike sorting. The feature extraction is based on the generalized Hebbian algorithm (GHA) [5,6], which is an incremental principal component analysis (PCA) algorithm capable of extracting PCA features without the need of covariance matrix of input data. Therefore, the GHA is more suitable for hardware implementation, because there is no computation and storage overhead for processing covariance matrix. The resulting features are then clustered using the fuzzy C-means (FCM) algorithm [7,8]. As compared with other unsupervised clustering algorithms, such as K-means, the FCM algorithm has the advantage that its performance is less sensitive to the selection of initial centers. Therefore, it is less likely for the FCM-based feature clustering to fall into a poor local optimum.
A challenging issue of combining both GHA and FCM circuits in one chip is the high area costs. To reduce the hardware resource utilization, a spike is separated into a number of smaller blocks in the GHA circuit. Different blocks share the same circuit for PCA feature extraction. In addition, the FCM circuit pre-computes the common factors of different membership coefficients, so that the number of adders and multipliers for membership coefficients computation can be reduced. The requirement for storing entire membership coefficients matrix for center computation can also be evaded by adopting an incremental center computation scheme in the FCM circuit.
To physically evaluate the proposed architecture, a spike sorting system on an FPGA-based System-on-Chip (SOC) platform is implemented, where the proposed spike sorting architecture is used as a hardware accelerator. The softcore processor in the SOC platform is adopted for coordinating different components in the SOC. It is also used to measure the computation time of the proposed architecture. Experimental results reveal that the proposed architecture is an effective alternative for real-time spike sorting with accurate feature extraction and clustering.

Related Works
A software implementation of PCA and FCM for automatic and online spike sorting is implemented in [9]. It uses a partial single value decomposition (PSVD) preprocessing technique for enhancing the robustness and speed of PCA. A spike sorting system based on SOC is proposed in [10]. The system is implemented on the Smartdust platform, in which PCA-based feature extraction and K-means-based clustering are realized by software. For these software approaches, real-time spike sorting operations may be difficult when the processor operates at low clock rates. A number of FPGA-based hardware architectures have been proposed to expedite the spike sorting. The architectures in [11,12] are able to perform online PCA/GHA feature extraction. The architectures for discrete wavelet transform (DWT)-based feature extraction are proposed in [13,14]. The circuit for the extraction of zero-crossing features (ZCF) of spike trains is presented in [15]. The architecture in [16] is capable of performing self-organizing map (SOM)-based clustering. A common drawback of these architectures is that they are not able to perform both feature extraction and clustering. Because all of these operations are required for spike sorting, hardware implementation of any single operation may not be able to provide sufficient throughput at the front end.
Studies in [17][18][19] have proposed GHA architectures for texture classification and face recognition. Although these architecture may be directly used for spike sorting, some architectures are not suited, because of their high area costs and/or long latency. The architecture presented in [17] provides high throughput for GHA training, since it processes all elements of input vectors concurrently. However, the area cost of the architecture grows linearly with the dimension of input vectors. On the contrary, only one element of input vectors is delivered to architecture proposed in [18] at a time. The architecture therefore has low area cost. Nevertheless, its latency grows linearly with the vector dimension. Hence, the architecture may not be suitable for the training of long spikes. The architecture in [19] separates input vectors into a number of smaller blocks. It then processes one block at a time. The architecture has both the advantages of low area costs and high computational speed [19].
In addition to GHA architectures, many FCM architectures [20][21][22] have been proposed for image processing. However, some of these architecture are difficult to be extended for spike sorting. The architecture in [20] is designed for clustering with only two classes. For spike sorting applications, the number of classes may be larger than two. The architecture in [21] allows all the membership coefficients associated with a training vector to be computed in parallel. The architecture has both high throughput and high area costs. The high hardware utilization increases the difficulty of integrating FCM with GHA on a single chip. An effective alternative to [20,21] is based on [22], which produces membership coefficients sequentially to reduce the area costs. In addition, the common factors of different membership coefficients are pre-computed by a shared circuit to further lower the hardware resource utilization. This pre-computation step is also able to accelerate the speed of membership coefficient computation.

Preliminaries
This section briefly review some basic facts of GHA and FCM algorithms and their applications to spike sorting.

GHA
Let: be the n-th input and output vectors to the GHA, respectively. In addition, m, p and t are the vector dimension, the number of principal components (PCs) and the number of input and output vectors for the GHA, respectively. The output vector, y(n), is related to the input vector, x(n), by: where the w ji (n) stands for the weight from the i-th synapse to the j-th neuron at iteration n. Let: be the j-th synaptic weight vector. Each synaptic weight vector, w j (n), is adapted by the Hebbian learning rule: where η denotes the learning rate. After a great deal of iterative computation and adaptation, w j (n) will asymptotically approach the eigenvector associated with the j-th eigenvalue, λ j , of the covariance matrix of input vectors, where λ 1 > λ 2 > · · · > λ p . To reduce the complexity of computing implementation, Equation (5) can be rewritten as: A more detailed discussion of GHA can be found in [5,6].

FCM
.., f t } be a set of training vectors for clustering, where t is the number of training vectors. The FCM computes v i , i = 1, ..., c, to separate F into c clusters. The v i is the center of cluster i. The FCM involves minimization of the following cost function: where u i,n is the membership of x k in class i. The cost function, J, is minimized by a two-step iteration in the FCM. In the first step, the centers, v 1 , ..., v c , are fixed, and the optimal membership matrix, {u i,n , i = 1, ..., c, n = 1, ..., t}, is computed by: After the first step, the membership matrix is then fixed, and the new center, v i , is obtained by: The iteration continues until the convergence of J.

Applications of GHA and FCM to Spike Sorting
The GHA and FCM can be used for feature extraction and clustering of spikes, respectively. To use GHA for feature extraction, the x(n) in Equation (2) is the n-th spike in the spike train. Therefore, the vector dimension, m, is the number of samples in a spike. Let: be the synaptic weight vectors of the GHA after the training process has completed. Based on w j , j = 1, ..., p, the GHA feature vector extracted from training vector, x(n) (denoted by f n ), is computed by: where: is the j-th element of f n . The set of feature vectors F = {f 1 , ..., f t } are then used as the training set for FCM. After the FCM training shown in Section 3.2 is completed, the resulting cluster centers, v 1 , ..., v c , are then used to cluster the spikes. The clustering is based on the following simple rule that spike x(n) is assigned to cluster i if: where d(f n , v j ) is the squared distance between f n and v j . Figure 1 shows the proposed architecture for spike sorting, which contains the GHA unit, FCM unit and global controller. Based on spikes x(n), n = 1, ..., t, the GHA unit is used for computing feature vectors f n , n = 1, ..., t. The FCM unit then computes centers v j , j = 1, ...c, for clustering. The global controller coordinates all the components in the architecture.

GHA Unit
The architecture of the GHA unit is an extension of the architecture presented in [19]. The GHA unit consists of three functional units: the memory unit, the synaptic weight updating (SWU) unit and the principal components computing (PCC) unit.

SWU Unit of GHA
The design of the SWU unit is based on Equation (6). Although the direct implementation of Equation (6) is possible, it will consume large hardware resources [19]. One way to reduce the resource consumption is by observing that Equation (6) can be rewritten as: where: and z j (n) = [z j1 (n), . . . , z jm (n)] T . The z ji (n) can be obtained from z (j−1)i (n) by: When j = 1, from Equations (15) and (16), it follows that: Therefore, the hardware implementation of Equations (14) and (16) is equivalent to that of Equation (6). Figure 2 depicts the hardware implementation of Equations (14) and (16). As shown in the figure, the SWU unit produces one synaptic weight vector at a time. The computation of w j (n + 1), the j-th weight vector at the iteration n + 1, requires the z j−1 (n), y(n) and w j (n) as inputs. In addition to w j (n + 1), the SWU unit also produces z j (n), which will then be used for the computation of w j+1 (n + 1). Hardware resource consumption can then be effectively reduced.  (14) and (16). One way to implement the SWU unit is to produce w j (n + 1) and z j (n) in one shot. However, m identical modules, individually shown in Figure 3, may be required, because the dimension of vectors is m. The area costs of the SWU unit then grow linearly with m. To further reduce the area costs, each of the output vectors, w j (n+1) and z j (n), is separated into b blocks, where each block contains q elements. The SWU unit only computes one block of w j (n + 1) and z j (n) at a time. Therefore, it will take b clock cycles to produce complete w j (n + 1) and z j (n).

PCC Unit of GHA
The PCC operations are based on Equation (3). Therefore, the PCC unit of the proposed architecture contains adders and multipliers. Because the number of multipliers grows with the vector dimension, m, the direct implementation using Equation (3) may consume large hardware resources when m becomes large. Similar to the SWU unit, the block-based computation is used for reducing the area costs. In fact, Equation (3) can be rewritten as: The implementation of Equation (18) needs only q multipliers, a q-input adder and an accumulator.

Memory Unit of GHA
The memory unit contains four buffers: Buffers A, B, C and D. Buffer A fetches and stores spike x(n) from the main memory. Buffer B contains z j (n) for the computation in PCC and SWU units. Buffer C consists of the synaptic weight vectors, w j (n). The feature vectors, f 1 , ..., f t , are stored in Buffer D. The Buffers A, B and C are shift registers. Buffer D is a two-port RAMfor the subsequent access by the FCM unit.

Operations of the GHA Unit
In typical spike sorting implementations [10,23], a spike may contain 64 samples. In addition, two PCs may suffice for feature extraction [1]. Therefore, without loss of generality, the GHA unit for m = 64 (i.e., the vector dimension is 64) and p = 2 (i.e., number of PCs is two) is considered in this subsection. In the GHA unit, each vector is separated into two blocks. Moreover, the dimension of each block is 32. Therefore, we set b = 2 and q = 32 for the circuit implementation. Figure 4 shows the resulting GHA circuit for m = 64, p = 2, b = 2 and q = 32. The operations of the GHA circuit can be separated into four states, as revealed in Figure 5. The most important operations of the GHA circuit are the PCC operations in State 3 and SWU operations in State 4. These two operations are further elaborated below.
Assume the input vector, x(n), is available in Buffer B. In addition, the current synaptic weight vectors, w 1 (n), w 2 (n), are stored in the Buffer C. Based on x(n) and w 1 (n), w 2 (n) the PCC unit produces output vector y 1 (n), y 2 (n). Figure 6 reveals the operations of the PCC unit for computing y 1 (n), y 2 (n). The computation of y j (n) is separated into two steps. The first step finds and, then, accumulates the result with that of the previous step to find y j (n). These two steps share the same circuit. It can also be observed from Figure 6 that the data obtained from Buffers B and C for PCC operations are also rotated back to Buffers B and C for subsequent operations in the SWU unit. Upon the completion of PCC operations, the SWU unit will be activated in State 4. Figure 7 shows the operations of the SWU unit. Using x(n), y j (n) and w j (n), j = 1, 2, the SWU unit computes the new synaptic weight vectors, w j (n + 1), j = 1, 2, which are then stored back to Buffer C for subsequent training. Similar to the computation of y j (n), the computation of w j (n + 1) consists of two steps. The first step computes the first half of w j (n+1) (i.e., w j,1 (n+1),...,w j,32 (n+1)). The second step calculates the second half. These two steps also share the same circuit. Moreover, the computation of w 1 (n + 1) also produces z 1 (n), which is stored back to Buffer B. As revealed in Figure 7c,d, z 1 (n) is then retrieved from Buffer B for the computation of w 2 (n + 1).  (c) (d) Figure 7. Operations of SWU unit in State 4 of the GHA circuit: (a) first step of computing w 1 (n + 1); (b) second step of computing w 1 (n + 1); (c) first step of computing w 2 (n + 1); (d) second step of computing w 2 (n + 1).
After the training process of the GHA circuit is completed, Buffer C contains the synaptic weight vectors, w 1 and w 2 . Based on the synaptic weight vectors, the feature extraction operations can then proceed, which are the same as those shown in Figure 6 for PCC operations. The computation results are stored in Buffer D as feature vectors f n , n = 1, ..., t, for the subsequent FCM clustering operations.

FCM Unit
The architecture of the FCM unit contains six sub-units: the pre-computation unit, the membership coefficients updating unit, centroid updating unit, cost function computation unit, FCM memory unit and control unit. The operations of each sub-unit are stated below.

Pre-Computation Unit of FCM
The pre-computation unit is used for reducing the computational complexity of the membership coefficients calculation. Observe that u i,n in Equation (8) can be rewritten as: where: Given f n and centers v 1 , ..., v c , membership coefficients u 1,n , ..., u c,n have the same P n . Therefore, the complexity for computing membership coefficients can be reduced by calculating P n in the pre-computation unit. Based on Equation (20), the architecture for computing P n can be divided into two stages. The first stage evaluates ||f n − v j || 2 . The second stage first finds the inverse of ||f n − v j || 2 and, then, accumulate this value with

Membership Updating Unit of FCM
Based on Equation (19), the membership updating unit uses the computation results of the precomputation unit for calculating the membership coefficients. Given f n , the membership coefficients computation unit computes u 2 i,n for i = 1, ..., c, one at a time. The circuit can be separated to two stages. The first stage is used for computing ||f n − v i || 2 P n . The membership coefficient, u 2 i,n , is then obtained by a cascade of inverse and square operations.

Center Updating Unit of FCM
The center updating unit incrementally computes the center of each cluster. The major advantage for the incremental computation is that it is not necessary to store the entire membership coefficients matrix for the center computation. Define the incremental center for the i-th cluster up to f n as: when n = t, v i (n) then is identical to the actual center v i given in Equation (9).
The center update unit operates in accordance with Equation (21). It contains a multiplier, a cell array and a divider. There are two groups of cells in the cell array. The cell i in the first group contains the accumulated sum ∑ n−1 k=1 f k u 2 i,k . Moreover, the cell i in the second group contains the accumulated sum ∑ n−1 k=1 u 2 i,k . Therefore, each cell in the array is actually an accumulator. The outputs of the array are used for computing v i (n) using a divider.

Cost Function Computation Unit of FCM
Similar to the centroid updating unit, the cost function unit incrementally computes the cost function, J. Define the incremental cost function, J(i, k), as: The circuit receives u 2 i,n and ||f n − v i || 2 from the membership coefficient updating unit. The product u 2 i,n ||f n − v i || 2 is then accumulated for computing J(i, n) in Equation (22). When i = c and n = t, J(i, n) then is identical to the actual cost function, J, given in Equation (7). Therefore, the output of the circuit becomes J as the cost function computations for all the training vectors are completed.

FCM Memory Unit of FCM
This unit is used for storing the centers for FCM clustering. There are two memory banks (Memory Bank 1 and Memory Bank 2) in the on-chip centroid memory unit. Memory Bank 1 stores the current centers, v 1 , ..., v c . Memory Bank 2 contains the new centers, v 1 , ..., v c , obtained from the center updating unit. Only the centers stored in the Memory Bank 1 are delivered to the pre-computation unit and membership updating unit for the membership coefficient computation. The updated centroids obtained from the centroid updating unit are stored in Memory Bank 2. Note that the centroids in Memory Bank 2 will not replace the centroids in Memory Bank 1 until all the input feature vectors, f n , n = 1, ..., t, are processed.

FCM Operations
For sake of simplicity, Figure 8 shows the FCM circuit for c = 2. The circuit therefore is able to separate the set of feature vectors into two classes. The extension of the circuit for larger c can be easily accomplished by enlarging the size of the cell array in the center update unit and the memory banks in the memory unit.
The operations of the circuit for c = 2 can be separated into four states. Table 1 reveals the operations of each state, which are described in terms of the signals produced at various nodes in the circuit. The locations of the nodes considered in Table 1 In the FCM circuit, each feature vector, f n , is retrieved from the GHA circuit. It stays in Node B of the FCM circuit for all four states. Each center, v i , is obtained from memory unit of the FCM circuit. It appears in Node A of the FCM circuit. Given the feature vector, f 1 , it can be observed from Table 1 that the goal of States 1 and 2 is to compute P 1 (shown in Node C) using the precomputation unit. After that, States 3 and 4 compute the membership coefficients, u 2 1,1 and u 2 2,1 , using the membership coefficients updating unit, respectively. The membership coefficients are produced in Node D in Figure 8. In addition, based on u 2 1,1 obtained from State 3, the incremental center, v 1 (1), and incremental cost, J(1, 1), are computed by the center updating unit and cost function computation unit in State 4, respectively. The v 1 (1) and J(1, 1) are produced at Nodes G and H, respectively. Based on u 2 2,1 obtained from State 4, the v 2 (1) and J(2, 1) are then computed in State 1 associated with f 2 , respectively. The P 2 is also computed in this state. The States 2-4 associated with f 2 then operate in the same manner as those associated with f 1 . Figure 9 shows the corresponding state diagram in the controller of the FCM circuit.

Cluster Validity Index Computation
The proposed circuit can be shared by FCM operations with different numbers of clusters (i.e., different c's). This can easily be accomplished by designing the cell array in the center updating unit and the memory bank in the memory unit in accordance with the largest c. It is not necessary to modify other parts of the FCM circuit. An advantage of allowing different c's to share the same circuit is that it helps to perform automatic detection of cluster numbers when prior knowledge of the number of clusters is not available.
The proposed method for automatic detection of number of clusters is based on a cluster validity index [24,25]. In this approach, FCM-based clusterings with different numbers of clusters are executed independently. The value of the cluster validity index is computed after each clustering. The selected number of clusters for spike sorting is the number optimizing the cluster validity index.
A number of cluster validity indices have been proposed for FCM. The index used by the proposed circuit is based on a modified partition coefficient. For a given c, the partition coefficient, denoted as I(c), is defined as: The c maximizing I(c) is then selected as the cluster number for spike sorting. When I(c) is biased for smaller c values (or larger c values), a simple compensation function, c×δ, can be included in I(c), where δ is a constant. In the proposed circuit, the cell array in the centroid computation unit accumulates u 2 i,n . Therefore, the proposed circuit can be used for the computation of I(c) with only minor modifications. Other cluster validity indices [25] can also be used for the proposed FCM circuit. A dedicated unit for the computation of these indices may be required.

Interaction Between GHA and FCM Circuits
The global controller of the proposed spike sorting circuit shown in Figure 1 is responsible for the interaction between GHA and FCM circuits. Figure 10 shows the state diagram of the global controller. From Figure 10, it follows that the GHA unit will operate first for the feature extraction. The spikes, x(n), n = 1, ..., t, are delivered to the GHA circuit one at a time for the training of synaptic weight vectors, w 1 , .., p. The delivery of the same set of spikes, x(n), n = 1, ..., t, will not be halted until the GHA training is completed. Upon the completion of GHA training, the spike set, x(n), n = 1, ..., t, is delivered to the GHA circuit again for the computation of feature vectors f n , n = 1, ..., t. The FCM circuit will not be activated until all the feature vectors, f n , n = 1, ..., t, are available in Buffer D of the GHA unit. After that, the feature vectors, f n , n = 1, ..., t, are delivered one at a time to the FCM unit from the GHA unit for the training of centers v 1 , ..., v c . The delivery of the same set of feature vectors, f n , n = 1, ..., t, will not be halted until the FCM training is completed. The resulting centers, v 1 , ..., v c , computed by the FCM unit can then be used for the classification of spikes.

SOC-Based Spike Sorting System
The proposed architecture is used as a custom user logic in an SOC system consisting of a softcore NIOSCPU [26], DMAcontroller, Ethernet MACcontroller and on-chip RAM, as depicted in Figure 11. In a typical spike sorting system, there are three operations: spike detection, feature extraction and clustering; as shown in Figure 12. The nonlinear energy operations (NEO) and thresholding can be used for the spike detection, which is implemented by software running on a NIOS softcore CPU. The operations supported by the proposed architecture are feature extraction and clustering. All the detected spikes are stored in the on-chip RAM and then transported to the proposed spike sorting circuit for both feature extraction and clustering. The DMA-based training data delivery is performed so that the memory access overhead can be minimized. The softcore NIOS CPU coordinates different components in the SOC. It is responsible for circuit activation and control. The results of feature extraction and clustering are stored in the memory unit of the GHA and FCM circuits for subsequent operations. Figure 11. The System-on-Chip (SOC)-based spike sorting system.

Performance Analysis and Experimental Results
In order to evaluate the performance of the proposed architecture for spike sorting, the simulator developed in [23] is adopted to generate extracellular recordings. The simulation gives access to ground truth about spiking activity in the recording and, thereby, facilitates a quantitative assessment of architecture performance, since the features of the spike trains are known a priori. Various sets of spikes with different signal-to-noise (SNR) ratios and interference levels have been created by the simulator for our experiments. All the spikes are recorded with a sampling rate of 24,000 samples/s. The length of each spike is 2.67 ms. Therefore, each spike has 64 samples. The dimension of vectors for GHA training, therefore, is m = 64. The number of PCs is p = 2 for the circuit design.
We first consider the classification correct rate (CCR) of the proposed architecture. The CCR for spike sorting is defined as the number of spikes that are correctly classified by the total number of spikes. To show the robustness of the proposed architecture against noise interference, various SNR ratios are considered, ranging from SNR = 1 dB to 10 dB. Table 2 shows the resulting CCRs for the spike trains with two target neurons (c = 2) and three target neurons (c = 3). The number of interfering neurons is two. The duration of the spike trains is 14 seconds. The spikes extracted from the spike trains are used for GHA and FCM training, as well as spike classification. The total number of spikes used for training and classification (t) and the number of spikes that are correctly classified (t) are also included in the table. Because the performance of FCM training may be dependent on the selection of initial vectors, each CCR value in the table is the average CCR values of 40 independent executions. From the table, it can be observed that the proposed architecture is able to attains CCR above 84 % for c = 3 when SNR = 1 dB. To further elaborate the effectiveness of the GHA and FCM algorithms for spike sorting, Figure 13 shows the distribution of GHA feature vectors of spikes for SNR = 4 and the results of FCM clustering. The center of each cluster produced by FCM is also marked in the figure. By comparing Figure 13a with Figure 13b, we see that the proposed GHA and FCM circuits are able to correctly separate spikes, even for large noises.
We also use the the distribution of GHA feature vectors in Figure 13a as an example to evaluate the effectiveness of the automatic detection of the number of clusters. The cluster validity index I(c) defined in Equation (23) is computed for c = 2, 3 and 4, based on the distribution. The computed values, I(2), I(3) and I(4), are 1,594, 1,12 and 1,467, respectively. Because I(3) has the largest, the estimated cluster number is c = 3, which is consistent with the ground truth.
In addition to SNR values, the number of interfering neurons may also influence the CCRs for spike sorting. Table 3 shows the CCRs of the GHA and FCM algorithms for different numbers of interfering neurons e. In the experiments, the number of target neurons c = 3. It can be observed from Table 3 that only a small degradation is introduced when the number of interfering neurons increases.
The performance of the proposed architecture with imperfect spike detectors is considered in the experiments shown in Table 4 for three target neurons and two interfering neurons. The training sets are obtained from NEO detectors. Therefore, the training sets contain both true positive spikes (i.e., actual spikes) and false positive spikes (i.e., although detected by the detectors, they are actually not spikes). The training results are then applied to the test sets (different from the training set) for CCR measurements. The false positive ratio (FPR) in Table 4 is defined as the number of false positive spikes divided by the total number of spikes in the training set. It can be observed from Table 4 that the CCRs are robust against the FPRs. When the FPRs are below 50 %, the proposed architecture is able to maintain CCRs above 80% for SNR at 1 dB.  In addition to having high CCR, the proposed spike sorting circuit features low area complexities. Table 5 shows the area complexities of the proposed circuit. Because adders, multipliers and registers are the basic building blocks of the GHA and FCM architectures, the area complexities considered in the table are separated into three categories: the number of adders, the number of multipliers and the number of registers. We can see from the table that the number of adders and multipliers grows linearly with the number of PCs, p, and number of elements in a block, q. They are not dependent on the vector dimension, m, the number of clusters, c, and the number of training vectors, t. Table 5. Area complexities of the proposed spike sorting circuit.

GHA FCM Total
Adders Because the spike sorting circuit contains buffers for storing spikes, synaptic weight vectors, and feature vectors, the number of registers increases linearly with p, m, c and t. Table 6 shows the hardware resource utilization for spike sorting applications with p = 2, m = 64, c = 3, and t = 800. The design platform for the experiments is the Altera Quartus II with Qsys [27]. The target FPGA device is an Altera Cyclone IV EP4CGX150DF31C7. Three different area resources are considered in the Table 6: logic elements (LEs), embedded memory bits and embedded multipliers. The LEs are used for the implementation of adders, multipliers and registers in the GHA and FCM architectures. The LEs, embedded memory bits and memory bits are used for the implementation of the NIOS CPU of the SOC system. The embedded multipliers are used for the implementation of the multipliers of the GHA and FCM architectures. There are 149,760 LEs, 720 embedded multipliers and 6,635,520 memory bits in the target FPGA device. It can be observed from Table 6 that only limited hardware resources are consumed by the proposed circuit. The NIOS CPU also utilizes hardware resources. However, the hardware costs of the SOC are only slightly larger than those of the spike sorting circuit, as shown in Table 6.
The proposed SOC-based spike sorting system features fast computation. Table 7 shows the training time of the proposed spike sorting system for various clock rates. The training time of its software counterpart running on the Intel I7 processor is also included in the table for comparison purposes. The training set for these experiments consists of 800 spikes. The number of epochs for GHA training is 100. The number of iterations for FCM training is 10. It can be observed from Table 7 that the propose SOC-based spike sorting system is able to operate up to a 1 GHz clock rate. In addition, both the GHA and FCM training times decrease linearly with the clock rate. When clock rate becomes 1 GHz, the total training time of the proposed SOC-based spike sorting system is only 1.99 ms. By contrast, the training time of the Intel I7 processor is 193.18 ms. The speedup of the proposed hardware system over its software counterpart is therefore 97.08. To further facilitate the spike sorting computation, the computation of cluster validity index is also implemented by hardware. The cluster validity index given in Equation (23) mainly contains the sum of membership coefficients. Because the FCM circuit is able to provide the sum of membership coefficients for each cluster without additional overhead, the major operations of the cluster validity index circuit are only to the collect cluster validity indices for different number of clusters. After all the indices are available, the circuit then finds the index with the largest value. As a result, the latency of the circuit for computing I(c) is only four clock cycles (i.e., 4 ns at 1 GHz). The circuit also consumes only 227 logic elements. No embedded multiplier and memory bit are required. The proposed cluster validity index is therefore beneficial for enhancing the ability to automatically select the right number of clusters at the expense of low latency and area overhead.
The proposed architecture can be directly used for offline data analysis, where the stored raw spike trains can be used as the inputs. The proposed architecture then performs fast GHA and FCM training for real-time spike sorting with training time 1.99 ms at a 1 GHz clock rate. For the software-based spike sorting systems, such as Wave Clus [28], the total computation time (measured from the Intel I7 2.61 GHz processor) for feature extraction and clustering is 464.35 ms for the same dataset. Direct comparisons between the proposed architecture and Wave Clus may be difficult, because they are based on different algorithms for spike sorting. However, the superiority in the computation speed of the proposed system can still be observed. From Table 7, we see that the proposed architecture has longer computation time, only when it operates at a slower clock rate of 1 MHz.
The power consumption of the proposed architecture implemented by FPGA at different clock rates is included in Table 8. The power consumption estimation is based on the tool, PowerPlay Power Analyzer, provided by Altera. It can be observed from Table 8 that the power consumption is 91.75 mW when the clock rate is 1 MHz. Therefore, the architecture at 1 MHz may be suited for online applications with limited power resources. The power consumption is 5.84 W when the architecture operates at 1 GHz. An advantage of the proposed architecture operating at a higher clock rate is that it is able to perform GHA and FCM training for a larger number of channels per unit time. To show this fact, the number of channels the proposed architecture can process per second for each clock rate is also included in Table 8. It can be observed from the table that the number of channels per second is 502.5 for a clock rate of 1 GHz. In this case, the proposed architecture is suited for real-time offline operations. We have also implemented the architecture by ASIC as implantable chips. The implementation is based on Sypnosys Design Compiler with TSMC90 nm technology library. The power estimation is based on PrimeTime provided by Sypnosys. The estimated power dissipation of the ASIC is only 440.34 µW at a 1 MHz clock rate, which is significantly less than the power consumption of its FPGA counterpart (i.e., 91.75 mW) at the same clock rate. The total area of the proposed architecture is 0.7432 mm 2 . The power density, therefore, is 59.25 mW/cm 2 . It is recommended that the power density should be below 80 mW/cm 2 [29], so that potential brain damage can be avoided. The ASIC version of the proposed spike sorting system has a power density below 80 mW/cm 2 ; therefore, it may be well suited as an implantable chip at the front end.
Next, we compare the proposed GHA and FCM circuits with existing implementations. Table 9 compares the average CCRs of the various algorithms adopted by [9,10,12] for different SNR ratios. All the algorithms are evaluated with the same set of parameters (i.e., c = 3, p = 2, m = 64). Given an SNR value, the CCRs for different algorithms are measured on the same noisy spike trains. To obtain the average CCR, the algorithms are executed 40 times for each SNR value, independently. It can be observed from Table 9 that the combination of GHA and FCM for spike sorting outperforms other algorithms. As compared with the K-Means algorithm, the FCM algorithm is less sensitive to the selection of initial centers for clustering. Therefore, the FCM-based spike sorting algorithms have higher CCRs, as revealed in the table. This is especially true when SNR values are high (e.g., 10 dB). In fact, when noise energy becomes lower, the feature vectors produced by GHA or PCA algorithms are located in smaller regions. As a result, it is more likely that some of the randomly selected initial centers are far away from these regions. The K-means algorithm is then likely to fall into a poor local optimum. Given the same algorithm for clustering (such as FCM), the GHA attains comparable performance with PCA. Therefore, the combination of GHA with FCM is an effective alternative for spike sorting.
In Table 10, we compare the area costs and throughput of the proposed GHA circuit with those of other FPGA-based hardware implementations [12,30] for feature extraction. The throughput is defined as the number of input training vectors the circuit can process per second. It can be observed from Table 10 that the proposed GHA architecture attains the highest clock rate and throughput. In fact, the proposed architecture has a throughput 28.125 (4.50 × 10 7 vs. 1.60 × 10 6 ) and 16.3 times ((4.50 × 10 7 vs. 2.75 × 10 6 ) higher than that of the architectures in [12,30], respectively. The proposed algorithm has superior performance, because it is based on shift registers for storing weight vectors and input vectors for high-speed computation. Finally, Table 11 reveals the area costs and throughput of various clustering implementations. The throughput is also defined as the number of input training vectors the FCM architecture can process per second. The proposed architecture consumes more hardware resources. However, the FCM architecture in [20] is implemented only for two clusters (c = 2). Therefore, the architecture is applicable only to two target neurons. By contrast, the proposed circuit is implemented for three clusters (c = 3). Therefore, with throughput 10.3 times (3.38 × 10 7 vs. 3.28 × 10 6 ) higher, the proposed architecture is also able to perform FCM with a larger number of clusters as compared with the work in [20]. All these facts demonstrate the effectiveness of the proposed hardware spike sorting system.

Concluding Remarks
The proposed architecture has been implemented on the Altera FPGA Cyclone IV for physical performance measurement. The architecture is used as an hardware accelerator to the NIOS CPU in an SOC platform. Experimental results reveal that the proposed spike sorting architecture has the advantages of high CCR and high computation speed. For SNR = 10, its CCR is above 96% for three target neurons. When SNR becomes 1 dB, it is still able to retain CCR above 84%. The architecture is able to achieve a 1 GHz clock rate. The speedup over its software counterpart running on the Intel I7 processor is above 97. Its GHA and FCM circuits have a higher computation speed compared with existing hardware implementations for feature extraction and clustering. In particular, the proposed GHA circuit has throughput 16.3 times higher than that of the existing implementations. These results show that the proposed system implemented by FPGA is an effective real-time training device for spike sorting.