A Low Cost VLSI Architecture for Spike Sorting Based on Feature Extraction with Peak Search

The goal of this paper is to present a novel VLSI architecture for spike sorting with high classification accuracy, low area costs and low power consumption. A novel feature extraction algorithm with low computational complexities is proposed for the design of the architecture. In the feature extraction algorithm, a spike is separated into two portions based on its peak value. The area of each portion is then used as a feature. The algorithm is simple to implement and less susceptible to noise interference. Based on the algorithm, a novel architecture capable of identifying peak values and computing spike areas concurrently is proposed. To further accelerate the computation, a spike can be divided into a number of segments for the local feature computation. The local features are subsequently merged with the global ones by a simple hardware circuit. The architecture can also be easily operated in conjunction with the circuits for commonly-used spike detection algorithms, such as the Non-linear Energy Operator (NEO). The architecture has been implemented by an Application-Specific Integrated Circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture is well suited for real-time multi-channel spike detection and feature extraction requiring low hardware area costs, low power consumption and high classification accuracy.


Introduction
There is an increasing demand in data-acquisition systems for neurophysiology to record simultaneously from many channels over long time periods [1]. These experiments accumulate large amounts of data, which would be processed by spike sorting systems for analyzing the activities of neurons. A typical spike sorting system [2,3] usually involves complicated feature extraction and spike classification operations for separating spikes from background noise and clustering the detected spikes. A large amount of spike trains would impose heavy computational load for a software spike sorting system, resulting in a long processing time.
One approach to reduce the computation time is to implement a spike sorting system by hardware. Hardware systems offering dedicated circuits substantially outperform their software counterparts in terms of computational performance. Hardware solutions are beneficial for neurophysiological signal recordings and analysis where real-time computing is crucial. There are two hardware approaches: Field Programmable Gate Array (FPGA) [4] and Application-Specific Integrated Circuit (ASIC) [5]. Some FPGA circuits [6,7] possess high area complexities and/or power consumption, which may be suited only for offline processing. On the contrary, ASIC architectures may have lower area costs and power dissipation. Many ASIC-based spike sorting implementations are then proposed for in vivo applications, where area-and power-efficient design is desired.
Many ASIC architectures based on Principal Component Analysis (PCA) [8][9][10] have been proposed for hardware spike sorting. Although they are effective for feature extraction, the inherent complexities for the computations of the covariance matrix and eigenvalue decomposition in the PCA algorithm may impose high hardware and power costs. The PCA variants, such as the Generalized Hebbian Algorithm (GHA) [11], are able to reduce the hardware costs by lifting the requirements for covariance matrix computation. In the GHA, the principal components are updated incrementally based on a set of training data. Nevertheless, its iterative training procedure may still be a bottleneck for online applications.
Alternatives to PCA and GHA include techniques such as discrete derivatives [12,13], integral transform [14] and zero-crossing [15] for feature extraction. The techniques feature low computational costs without additional training procedures. Nevertheless, some algorithms may be prone to noises due to the simple feature extraction procedure without taking noise into consideration. In addition, some of the algorithms have not been implemented by hardware. The effectiveness of the algorithms for ASIC implementation as compared with other techniques may still need to be evaluated.
The objective of this paper is to present a novel ASIC implementation for spike sorting featuring high classification accuracy, low area costs and low power consumption. The architecture is based on a novel feature extraction algorithm with low computational complexities. In the feature extraction algorithm, the location of the global minimum (or maximum) of a spike is first identified. Based on the location, the spike is then separated into two portions. The area of each portion is then used as a feature. Similar to the algorithms in [12][13][14][15], the proposed algorithm is simple to implement. In addition, it may be less susceptible to noise interference. Observe that the variations in the location of the global minimum (or maximum) of a spike due to noises may be small provided that the other local minimal (or maximal) values are significantly larger (or smaller) than the global one. This may be the case for some spike waveforms. The algorithm may then provide high immunity to noise corruption. Furthermore, the area of each position may be divided by the distance between the global maximum and minimum to enhance the classification accuracy.
A novel VLSI architecture is also presented for the novel feature extraction algorithm. In the architecture, the search for the global minimum and the computation of the area of portions separated by the minimum value are carried out concurrently. This is beneficial for pipelining operation for enhancing the throughput of the feature extraction. To further expedite the process, the local peak search and area computation over smaller segments of the spike can be carried out first. The local peak values and areas are subsequently merged with the global ones by a simple hardware circuit. In addition, due to its simplicity, the architecture can be operated in conjunction with other spike detection circuits. In this work, the circuit based on the Non-linear Energy Operator (NEO) [16] is employed for the multi-channel spike detection. To minimize the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. A Clock Gating (CG) technique [17] is also employed to supply the system clock only to the active components of the circuit. A number of ASIC implementations are presented to demonstrate the effectiveness of the proposed architecture. Experimental results reveal that the proposed architecture is an effective alternative for in vivo multi-channel spike sorting with high classification accuracy, low power dissipation and low hardware area costs.
The remaining parts of this paper are organized as follows. The proposed feature extraction algorithm and the corresponding VLSI architecture are presented in Section 2. The architecture of the multi-channel spike sorting system supporting both the NEO and the proposed algorithm is presented in Section 3. Section 4 then evaluates the performance of the spike sorting system. Finally, Section 5 includes some concluding remarks.

The Proposed Algorithm for the Feature Extraction of Spikes
This section presents a novel algorithm for the feature extraction of spikes. Consider a spike x with length N, where x i , i = 1, ..., N, is the i-th sample of x. Let i min and i max be the index of the global minimum and maximum of x, respectively. That is, The index i min (or i max ) is used to separate the sample indices of spike x into two intervals [1, i min ] and [i min + 1, N]. Let a 1 and a 2 be the two features based on i min . They are computed by: The a 1 and a 2 can then be viewed as the area of the spike in the two intervals with x i min as the reference level. Alternatively, x i max can be used as the reference level. In this case, the separation of x is based on i max , and the corresponding features, denoted by b 1 and b 2 , are given by: In addition to the areas, the difference between i min and i max could be a feature beneficial for spike sorting. One way to incorporate the feature is based on the division as shown below.
where f i , i = 1, 2, are the final features produced by the proposed algorithm. An advantage of the proposed algorithm is that the feature vectors extracted by the algorithm may not be susceptible to noise interference. A contributing factor to the noise robustness is that the variations of the locations of peaks (i.e., i min or i max ) may not be high even for large background noises. For example, Figure 1 shows the locations and values of peaks of spikes corrupted by background noises with different SNR levels. Each spike contains 64 samples (i.e., N = 64). The spike data for the figure is obtained from the simulator developed in [18]. It can be observed from Figure 1 that the location variations are small for all of the SNR levels considered in the figure. In particular, when SNR = 1 dB, the location of the minimum value is 41, as shown in Figure 1d. From Figure 1a, we see that the true location of the minimum value is 38. Because the length of spikes is 64, the variation is only 4.68%. This may be beneficial for maintaining low variations of f i in Equation (6) in the presence of noise.
The variations of the locations of peaks may also be small when a spike is overlapped by another one. Examples of spikes with different degrees of overlapping are revealed in For the sake of simplicity, we let J 1 = [1, i min ] and J 2 = [i min + 1, N]. In addition to the locations and values of peaks, the J 1 and J 2 are marked for each spike in the figures. This is beneficial for the observance of the impact on the extraction of feature vectors due to spike overlapping. We first consider the example without spike overlapping, as shown in Figure 2. There are two spikes in the example. The first and the second spike, denoted by Spike 1 and Spike 2, are located in the first 64 samples and final 64 samples in Figure 2a,b, respectively. Therefore, after the combination, the waveform of each individual waveform remains unaltered, as revealed in Figure 2c. That is, the peak locations stay the same, and the J 2 of Spike 1 is not overlapped with the J 1 of Spike 2. The feature vectors extracted from these two spikes will not be changed after the combination.
In the next example, Spike 2 is shifted leftward by 16 samples. Because the length of each spike is 64 samples, the Spike 1 and Spike 2 are overlapping by 25%, as shown in Figure 3. In this case, the combination of these two spikes introduces a slight distortion to each individual spike, as revealed in Figure 3c. Nevertheless, it can be observed that the variations in peak locations are small for each spike. Therefore, J 1 and J 2 can still be identified accurately for each spike. We can also see from Figure 3c that the J 2 of Spike 1 is overlapped with the J 1 of Spike 2. That is, one of the two areas separated by the i min of Spike 1 also belongs to Spike 2, and vice versa. The a 2 of Spike 1 and a 1 of Spike 2 may then not be accurately extracted. However, because the remaining intervals (i.e., J 1 of Spike 1 and J 2 of Spike 2) are still non-overlapping intervals, the a 1 of Spike 1 and a 2 of Spike 2 can be computed accurately. The feature vectors may still be useful for subsequent spike classification.    Figure 4 shows the third example, where Spike 1 and Spike 2 are overlapping by 50%. We can see from Figure 4 that the peak location variations are still small even for this case. This is helpful for the successful identification of J 1 and J 2 for each spike. Due to the large area overlapping, the distortion of spike waveforms may be visible. This is particularly true for Spike 1 by comparing the waveform of the first 64 samples in Figure 4a with that of Figure 4c. This can also be further confirmed by observing from Figure 4c that both the J 1 and J 2 of Spike 1 are overlapped with both J 1 of Spike 2. The correct classification of Spike 1 may then be difficult. Nevertheless, the successful classification of Spike 2 may still be possible because its J 2 is not overlapped with the intervals of Spike 1, and accurate extraction of a 2 and subsequently classification for Spike 2 may still be successful.
Finally, when Spike 1 and Spike 2 are overlapping by more than 50%, large distortion of spike waveforms may be observed. The deviations of peak locations from their original ones before spike combinations may then be large, resulting in inaccurate identification of J 1 and J 2 . In addition, larger overlapping of J 1 and J 2 of both spikes is possible. Consequently, accurate extraction of a 1 and a 2 of Spikes 1 and 2 may become more difficult in this case.
In summary, based on the results provided by the above examples, we see that the proposed algorithm may be robust to noise interference. The locations of peaks may not be susceptible to background noises and spike overlapping. The variations may be small even when the SNR = 1 dB and/or the degree of spike overlapping is up to 50%. The proposed algorithm is based on the locations of peaks for the computation of feature vectors. The robustness of the peak locations observed from the examples is then beneficial for providing useful insight into the effectiveness of the proposed algorithm.

Operations of the Proposed Architecture Based on the Algorithm
From Equation (6), we see that the computation of f i is dependent of the selection of x i min or x i max as the reference. For the sake of simplicity, only the case where x i min is the reference is considered for the hardware implementation. The hardware design for the other case can be followed in a similar fashion.
One direct approach for the hardware implementation of the proposed algorithm is to first identify i min and i max from the spike x. The computation of a i and f i is then followed. Although the approach is simple, it is necessary to store the spike x. This is because the spike x is first used for finding i min and i max , and then, it is re-used again for computing a i , i = 1, 2. As a consequence, the latency of the circuit may be long. In addition, the circuit may need a buffer to store x. When the dimension N of x is large, the area costs may be high.
The proposed architecture is able to alleviate the drawbacks stated above. It concurrently computes i min , i max and a i , f i , i = 1, 2. To carry out the concurrent operation, a novel incremental approach is proposed. Before presenting the approach, we first note that a i , i = 1, 2, in Equations (2) and (3) can be rewritten as: Both a 1 and a 2 can be computed after all of the N samples of x are available. In the case that only the first j samples of x are available, we define: as the current i min and i max up to the j-th sample of x, respectively. Based on p(j), the incremental versions of a 1 and a 2 , denoted by a 1 (j) and a 2 (j), are defined as: where: Clearly, when j = N, it follows from Equations (1) and (9) that p(N) = i min . Therefore, a 1 (N) = a 1 and a 2 (N) = a 2 . Based on a i (j), i = 1, 2, we then define f i (j), i = 1, 2, as: The f i (j) can also be viewed as the incremental version of f i , i = 1, 2. From Equations (6) and (13), it can also be easily shown that f i (N) = f i when x i min is used as the reference.
It is interesting to note that the computation of S i (j), i = 1, 2, can be carried out recursively from their predecessors S i (j − 1), i = 1, 2. To explore the recursive relationship, two cases are considered separately. The first case is p(j) = p(j − 1). This implies that the current i min remains the same after the new sample x j has arrived. In this case, In the second case, p(j) = p(j − 1). This occurs only when x j = min 1≤i≤j x i . Therefore, in this case, the current i min is updated as p(j) = j. It then follows that: All of the operations are based on the initial conditions S 1 (−1) = S 2 (−1) = 0. We can observe from Equations (14)-(17) that S 1 (j) and S 2 (j) can always be obtained from S i (j − 1), i = 1, 2, regardless of the variations of p(j). In addition to providing incremental computation, another important advantage is that it is not necessary to reuse x j for the computation of S i (k), i = 1, 2, for any k > j. This is beneficial for hardware design because it is not necessary to adopt a buffer for storing x for data re-use. Figure 5 shows the proposed architecture for feature extraction. As shown in Figure 5, it can be separated into three parts: the Global Minimum and maximum Search (GMS) unit, the ACCumulation (ACC) unit and the Feature Search (FS) unit. Based on an input sample x j , the GMS unit computes p(j) and q(j), the current i min and i max , respectively. The ACC unit then calculates S i (j), i = 1, 2. Based on the results of GMS and ACC units, the FS unit produces the features f i (j), i = 1, 2.

GMS Unit, ACC Unit and FS Unit
The architecture of the GMS unit is depicted in Figure 6. The architecture updates and stores the current i min and i max , as well as the current x i min . It contained two modules: the GMS 1 module and the GMS 2 module. The goal of the GMS 1 module and GMS 2 module is to find the p(j) and q(j), respectively. Given an input sample x j , in the GMS 1 module, the comparison between x j and x p(j−1) is carried out first. When x j > x p(j−1) , it follows from Equation (9) that p(j) = p(j − 1). In this case, no updating occurs. Otherwise, we update p(j) = j, and x p(j) = x j . The module will also notify the occurrence of updating to the ACC unit. The GMS 2 module operates in a similar fashion for the updating of q(j), which occurs when x j > x q(j−1) .
There are two components in the ACC unit, termed the ACC 1 module and the ACC 2 module, respectively. Their goal is to compute S 1 (j) and S 2 (j). The operations of the ACC 1 and ACC 2 modules are based on Equations (14)- (17), respectively. As shown in Figure 7, each module contains a register, an adder and a multiplexer. As the input x j enters the proposed architecture, the current values held by the register of the ACC 1 and ACC 2 modules are S 1 (j − 1) and S 2 (j − 1), respectively. The input sample x j directly enters the ACC 2 module.  In the case that no updating operations are required (i.e., p(j) = p(j − 1)), ACC 2 will add x j and S 2 (j − 1) to compute S 2 (j) in accordance with Equation (15). In addition, from Equation (14), we see that S 1 (j − 1) and S 1 (j) will be the same for ACC 1 module when no updating occurs. When p(j) = p(j − 1), ACC 2 will first compute S 2 (j − 1) + x j and then push this value to ACC 1. Meanwhile, based on Equation (17), the value of ACC 2 would then be reset to zero. Upon receiving the value S 2 (j − 1) + x j from ACC 2, ACC 1 adds the value to S 1 (j − 1) according to Equation (16). Figure 8 shows the architecture of the FS unit, which contains two components. The goal of the first component, termed the FS 1 module, is to compute a 1 (j) and a 2 (j) based on the results produced by the GMS unit and the ACC unit. The computation is carried out in accordance with Equations (10) and (11). Using the results produced by the FS 1 module, the second component of the FS unit, termed the FS 2 module, then computes the features f 1 (j) and f 2 (j) by Equation (13). As shown in Figure 8, both the FS 1 module and the FS 2 module mainly contain arithmetic operators for the hardware implementation of Equations (10), (11) and (13).

Extension of the Proposed Architecture for Parallel Computation
The proposed architecture in its basic form operates on spikes one sample at a time. When the number of samples N in a spike is large, the latency of the computation may be long. One way to solve the problem is to separate a spike into non-overlapping segments and then operate on the segments concurrently. The computation time can then be reduced.
Let K be the number of segments for the parallel computation, and K is a power of two. For the sake of simplicity, we first consider K = 2. The computation for other cases of K > 2 can be easily extended from this simple case. Figure 9 shows the proposed architecture for the concurrent operations with K = 2. Because K = 2, a spike will be separated into two segments. The samples x j with 1 ≤ j ≤ (N/2) belong to the first segment and the others the second segment. There are two inputs: x j and x j+N/2 , which are the j-th sample of the first and the second segments, respectively. There are two GMS units and two ACC units. As shown in the figure, the units denoted by GMS Unit A and ACC Unit A are for the first segment. The others are for the second segment. Their computation results are then combined by the circuits termed the GMS merger unit and the ACC merger unit. Finally, the FS unit computes the final features based on the data provided by the GMS merger unit and the ACC merger unit.
To discuss the operations of the circuits, we first define sets A(j), B(j), and C(j) as: In addition, Based on Equation (19), we further define: The goal of GMS Units A and B is to compute p A (j), q A (j) and p B (j), q B (j), respectively. Their architecture is identical to that shown in Figure 6. The ACC Units A and B find S A,1 (j), S A,2 (j) and S B,1 (j), S B,2 (j), respectively. These units can be operated by the architecture in Figure 7. Based on p A (j), q A (j) and p B (j), q B (j), the GMS merger unit produces p C (j), q C (j) (as well as x p C (j) ). The ACC merger unit then computes S C,1 (j), S C,2 (j). Figures 10 and 11 shows the architectures of the GMS merger unit and the ACC merger unit, respectively.
It can be easily observed from Equations (19) and (20) that: Therefore, as shown in Figure 10, the GMS merger unit carries out the comparison operations over x p A (j) and x p B (j) (or x q A (j) and x q B (j) ) for finding p C (j) (or q C (j)). Only comparators and multiplexers are required for the operations. We can also derive from Equations (21)-(24) that when p C (j) = p A (j), On the other hand, when p C (j) = p B (j),  The operations of the ACC merger unit then follow Equations (26)-(29). These operations can be carried out by only adders and multiplexers, as shown in Figure 11.
Based on the data produced by the GMS merger unit and ACC merger unit, the FC unit then computes f C,1 (j) and f C,2 (j), defined as: where: The architecture of the FS unit for parallel computation is also identical to that of the basic circuit shown in Figure 8. It can be observed that, when j = N/2, C(N/2) = {i : 1 ≤ i ≤ N}. Therefore, a C,i (N/2) = a i , i = 1, 2. As a result, f C,i (N/2) = f i , i = 1, 2.
The proposed architectures can also be viewed as trees. The basic architecture shown in Figure 5 is a unitary tree with a leaf node and a root node. The parallel computation circuit in Figure 9 for K = 2 can be viewed as a simple binary tree with two leaf nodes, one intermediate node and one root node. Each leaf node contains a GMS unit and an ACC unit. Each intermediate node consists of a GMS merger unit and an ACC merger unit. The root node comprises the FS unit.
The parallel computation for a larger number of K, where K is a power of two, is a simple extension of K = 2. In this case, the circuit can be viewed as a binary tree with K leaf nodes and K − 1 intermediate nodes and a root node. There are 2 + log 2 K layers for the parallel computation. The leaf nodes form the first layer. The intermediate nodes can be organized into the subsequent log 2 K layers. The root node is the final layer of the circuit. Figure 12 shows the examples of the binary trees for K = 2 and K = 4. The circuit with K a power of two operates by first separating a spike into K non-overlapping segments with length N/K. Each segment is assigned to a dedicated leaf node. Samples of each segment are delivered to the assigned leaf node one sample at a time. The data produced by the leaf nodes are then forwarded to the first layer of the intermediate nodes. Each layer would then receive data from its preceding layer and deliver the results to the next layer. The same process is repeated until the final layer is reached.

Proposed Architecture for Multi-Channel Feature Extraction
The proposed architecture can be easily operated in conjunction with the multi-channel spike detection circuit for multi-channel feature extraction. Figure 13 shows the architecture of a multi-channel spike sorting system based on the proposed feature extraction circuit. As depicted in the figure, there are three components: spike detection circuit, spike buffer and the proposed feature extraction circuit. Figure 13. The architecture of a multi-channel spike sorting system based on the proposed feature extraction circuit. The spike detection circuit is able to carry out the detection for multiple channels by the NEO algorithm. Let M be the number of channels for spike sorting. Assume all of the channels are sampled with the same sampling rate r s . Let T s = 1/r s be the sampling period. All of the channels are assumed to be sampled and multiplexed by a mixed mode circuit using the round robin approach. The mixed-mode circuit then delivers the samples one at a time to the spike detection circuit. Therefore, the circuit receives M samples during a time interval of length T s . Different samples received during the interval are from different channels.
The spike detection circuit can be separated into two portions: the NEO buffer and the NEO detection unit. The NEO buffer is a (M × N)-stage Serial-In-Serial-Out (SISO) shift register organized in a snake-like fashion. Each stage contains a sample of a spike train from a channel. It is therefore able to hold N consecutive samples of the spike trains from the M channels.
The bottom row of the buffer provides N consecutive samples of the spike train from a channel (say, channel h). It can be seen from Figure 14 that the bottom row of the NEO buffer is used for both NEO detection and peak alignment. The NEO detection unit takes three consecutive samples of the bottom row to carry out the NEO computation. The computation result is then compared to a given threshold γ. If the result is larger than the threshold, a hit event is issued. In addition, the entire last row is regarded as a detected spike and is copied to the spike buffer for spike alignment. The spike detection circuit is able to perform spike detection one channel at a time. After the spike detection of the channel h is completed in the current clock cycle, all of the spike samples already in the NEO buffer are shifted to the next stage, and a new sample from the next channel (selected in a round robin fashion) enters the first stage of the NEO buffer. This allows the spike detection for the next channel to be carried out in the next clock cycle.
The goal of the spike buffer is to hold the detected spikes produced by the spike detection circuit, carry out the alignment and deliver the detected spikes to the proposed feature extraction circuit. As depicted in Figure 15, there are two stages in the spike buffer: the input stage and the output stage. The input stage is an N-input N-output buffer holding the detected spikes provided by the spike detection circuit. The output stage is an N-input K-output buffer fetching data N samples at a time from the input buffer. The buffer then delivers data K samples at a time to the proposed feature extraction circuit. The input stage contains M entries, where each entry i stores the detected spike for channel i. Given N, K, the maximum number of channels M can be evaluated. For any channel h in the circuit, a detected spike in that channel could be discarded when the spike is over-written in the spike buffer by the next detected spike from the same channel h before it can be further processed.
Recall from Section 2.2.3 that the proposed architecture supporting parallel computation is able to accept K samples at a time. Therefore, the proposed architecture is able to process a detected spike with N samples in N/K clock cycles. To find the maximum number of channels M, the worst case scenario is considered. In the scenario, the detected spikes from all of the M channels are stored in the spike buffer and are not processed by the feature extraction circuit yet. In addition, the feature extraction circuit is currently busy. Assume that the newest detected spike is from channel h. In this case, the spike buffer is able to accept another detected spike from channel h without discarding the old one only after the feature extraction circuit has processed M spikes. Because the latency for processing each detected spike is N/K clock cycles, it follows that the input buffer is able to accept another detected spike from channel h after MN/K clock cycles. Let Q be the minimum number of samples between the peak of successive spikes detected by the NEO circuit from the same channel. Assume that Q is the same for all of the channels. It then follows that a detected spike from channel h is not overwritten and discarded when: where T c = 1/r c is the clock period and r c is the clock rate. It is interesting to know that the NEO circuit imposes an additional limit on the number of channels M. It is desired that the NEO circuit be able to receive one sample from each channel in a single sampling period T s . Based on the round robin scheme for fetching samples for different channels, it is therefore necessary that: Combining Equations (33) and (34), we see that the maximum number of channels, denoted by M max , should then satisfy:

Experimental Results
This section presents the performance of the proposed architecture. The area complexities are first considered. Five types of area costs are evaluated: the number of comparators, adders/subtractors, multipliers/dividers, registers and multiplexers. All of the costs are expressed in terms of the asymptotic function (i.e., the big O function). Table 1 shows the area complexities of the proposed feature extraction circuit. It can be observed from Table 1 that all of the area costs of the GMS units, ACC units, GMS merger units, ACC merger units and FS units are independent of the spike length N and the number of channels M. However, for the parallel computation, the number of comparators, adders/subtractors, registers and multiplexers of all of the leaf nodes will grow with the number of segments K. This is because the total number of leaf nodes is dependent on K, and each leaf node contains a GMS unit and an ACC unit. Table 1. The area complexities of the proposed feature extraction circuit.

GMS Unit
Similarly, because the total number of intermediate nodes is dependent on K and each intermediate node contains a GMS merger unit and an ACC merger unit, the area complexities of comparators, adders/subtractors and multiplexers of all of the intermediate nodes are O(K). On the contrary, there is only one root node for parallel computation, and the root node consists of only the FS unit. Among the units in the proposed architecture, the FS unit is the only unit containing multipliers/dividers. Therefore, the number of multipliers/dividers is fixed, and is independent of N, M and K. Summarizing the discussions stated above, we conclude that the complexities of the total number of comparators, adders/subtractors, registers and multiplexers of the proposed feature extraction circuit with K segments are O(K). Moreover, the total number of multipliers/dividers is only O(1).
Based on Table 1, the overall area complexities of the proposed spike sorting circuit are summarized in Table 2. To facilitate the evaluation, the area complexities of the NEO circuit and spike buffer are also included in the table. For the NEO circuit, it is necessary to store spike trains from all of the M channels for detection. In the spike buffer, each channel needs to have its own memory unit to hold the detected spikes for the subsequent feature extraction. Therefore, it can be observed from Table 2 that the number of registers in both the NEO circuit and spike buffer are dependent on the spike length N and the number of channels M. The total number of registers of the proposed spike sorting circuit is then O(K + MN). In addition, because the NEO circuit has only a fixed number of adders and multipliers independent of N and M, the overall area complexities for the adders and multipliers in the spike sorting circuit are only O(K) and O(1), respectively.

NEO Circuit
We next evaluate the actual hardware resource consumption of the proposed circuit. For the remaining evaluations of this section, the dimension of spikes is N = 64. The circuit implementation is carried out by ASIC with the Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm technology and clock gating. The gate level design platform is the Synopsys Design Compiler. Table 3 shows the area (µm 2 ) of the proposed circuit for different numbers of channels M and numbers of segments K. From Table 3, we see that the area of the proposed circuit grows with M and K, which is consistent with the results shown in Table 2. Table 4 shows the normalized area (µm 2 per channel) of the proposed architecture. The normalized area is defined as the area of the circuit divided by the number of channels M. The normalized area can be viewed as the average area cost per channel. Note that all of the channels share the same computation cores in the NEO circuit (i.e., the NEO detection unit) and the proposed feature extraction circuit. Therefore, we can see from Table 4 that the average area per channel decreases as M increases. Consequently, because of the hardware resource sharing scheme, the proposed architecture shows a higher efficiency in area costs for a larger number of channels. Although the normalized area grows with K for a fixed M, the circuits with larger K have the advantages of lower latency for feature extraction. Therefore, as shown in Equation (35), the channel capacity M max grows with K.  In addition to the area costs, the power consumption of the proposed architecture is also evaluated. Table 5 shows the normalized power dissipation (µW per channel) for different numbers of channels M with and without clock gating. The normalized power dissipation is defined as the total power consumption of the circuit divided by the number of channels M. In the experiment, the number of segment K = 1, and the clock rate r c is 1 MHz. The power consumption is measured numerically by Synopsis Prime Time. When the number of channels M increases, because the normalized area decreases, we can see from Table 5 that the normalized power consumption also lowers for the circuit without clock gating. In addition, the clock gating is able to further reduce the power consumption by not supplying the clock signal to the inactive components. However, when M increases, the components of the proposed feature extraction circuit may become more busy because all of the M channels share the circuit. As a result, when M increases from 32-64, it can be observed from Table 5 that the power consumption with clock gating is not lowered. Nevertheless, the the circuit with clock gating is still able to achieve a 33% power reduction as compared with its counterpart without clock gating.
We also compare the proposed architecture with other ASIC implementations [8][9][10][11] for spike sorting in Table 6. It may be difficult to directly compare these architectures because they may be implemented by different technologies and may be based on different spike detection and/or feature extraction algorithms. Moreover, their operation clock rates may be different. Nevertheless, it can be observed from Table 6 that, as compared with the existing architectures, the proposed architecture provides effective area-power performance while supporting both spike detection and feature extraction functions.
Note that, among these existing architectures, the proposed architecture and the one in [11] are based on the same clock rate, technology and spike detection scheme. We can see from Table 6 that the normalized area and normalized power of the proposed architecture are only 25.40% (i.e., 21,127 vs. 83,159) and 25.44% (i.e., 20.03 vs. 78.719) of those of the architecture in [11], respectively. This is because the proposed feature extraction algorithm has significantly lower computational complexities than those of the GHA algorithm adopted by [11].  Although the proposed feature extraction algorithm has simple computation, it is effective for spike classification. Tables 7-10 show the Classification Success Rate (CSR) of the Fuzzy C-Means (FCM) [19] algorithm by clustering the feature vectors produced by various feature extraction algorithms. The CSR is defined as the number of spikes that are correctly classified divided by the total number of spikes. The PCA, GHA and zero-crossing [15] algorithms considered in the tables are implemented by MATLAB with double precision floating numbers. The proposed algorithm is implemented by hardware with 10-bit finite precision.
To see the robustness of the proposed algorithm, different types of noise interference are included in the experiments: background noises, interfering neurons and overlapping spikes. They are considered separately to facilitate our observation. The spike trains for the experiments in Tables 7-9 are obtained from the simulator developed in [18]. It also gives access to the ground truth about the spiking activity in the recording for quantitative assessment. Table 7 shows the CSRs of various feature extraction algorithms for background noises with different SNR levels. In the experiments, the number of interfering neurons is set to be zero. In addition, 20% of the spikes are overlapping. It can be observed from Table 7 that the CSRs of the proposed algorithm are only slightly degraded as the SNR values decrease. In addition, the proposed algorithm has CSR performance comparable to that of PCA and GHA for all of the SNR levels. Moreover, it outperforms the zero-crossing algorithm. When the SNR = 1 dB, the proposed algorithm is still able to achieve 96.47% CSR. As shown in Figure 1, the proposed algorithm is robust to background noise because the variations of peak locations due to background noise corruption are small. This leads to successful identification of intervals J 1 and J 2 for the computation of features a 1 and a 2 . The CSRs of various algorithms for different numbers of interfering neurons are revealed in Table 8. The interfering neurons are the nearby neurons whose spike times are correlated with the original spike times of the target neurons. We set SNR = 4 dB for the background noises. The percentage of the overlapping spikes for the experiments is 20%. From Table 8, we can see that only a small degradation is observed for the proposed algorithm as the number of interfering neurons grows. Its performance is also comparable to that of the PCA and GHA algorithms. The influences of the overlapping spikes on CSRs are considered in Table 9. In the experiments, the SNR level of background noises is 8 dB. In addition, there are no interfering neurons, so that the the impact of overlapping spikes can be easily observed. It appears from Table 9 that the proposed algorithm is able to maintain CSRs above 95% even when the percentage of the overlapping spikes is 30%. Because the proposed algorithm may still be able to find the J 1 or J 2 correctly for the overlapping spikes as revealed in Figures 3 and 4, successful feature extraction and classification may still be possible. As a result, the proposed algorithm may be able to maintain high CSRs in the presence of overlapping spikes. Table 9. The CSRs of the various feature extraction algorithms for different percentages of overlapping spikes. The spike source data are obtained by the simulator in [18].

Algorithms
Percentage In addition to the simulator developed in [18], the spike trains in the wave_clus database are also considered in the experiments. Table 10 shows the resulting comparisons. Similar to the results shown in Tables 7-9, we can also see from Table 10 that the CSR performances of the PCA, GHA and the proposed algorithm are comparable. In addition, the proposed algorithm outperforms the zero-crossing algorithm. Although PCA and GHA offer higher classification accuracy, the algorithms have higher computational complexities. We can see from Table 6 that their hardware implementations may have large area costs and power consumption. The computational complexities of zero-crossing are low. However, the algorithm has inferior CSR values. In particular, for the spike train data "C_Difficult2_noise005" in Table 10, the CSR of the zero-crossing is only 69.92%. By contrast, the proposed algorithm still maintain high accuracy (i.e., 82.13%) for the data. Therefore, the proposed algorithms have the advantages of low computational complexities for efficient hardware implementation and are capable of providing feature vectors for spike classification with high accuracy.

Conclusions
We have implemented the proposed architecture for spike sorting by ASIC with 90-nm technology. Experimental results reveal that the proposed architecture has the advantages of low area costs, low power consumption and high classification accuracy. For the 32-channel design example provided in the paper, the normalized area is 21,127 µm 2 /channel, which is the lowest as compared with the existing designs considered in the paper. When operating at 1 MHz, the proposed architecture consumes normalized power of 20.03 µW/channel. The CSR values of the FCM based on the feature vectors provided by the proposed algorithm are also comparable to those of the PCA and GHA techniques. The proposed architecture therefore is an effective alternative to the applications where implantable spike sorting circuits with low power consumption, low area costs and high CSR are desired.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: