Deep Learning-Based Template Matching Spike Classiﬁcation for Extracellular Recordings

: We propose a deep learning-based spike sorting method for extracellular recordings. For analysis of extracellular single unit activity, the process of detecting and classifying action potentials called “spike sorting” has become essential. This is achieved through distinguishing the morphological di ﬀ erences of the spikes from each neuron, which arises from the di ﬀ erences of the surrounding environment and characteristics of the neurons. However, cases of high structural similarity and noise make the task di ﬃ cult. And for manual spike sorting, it requires professional knowledge along with extensive time cost and su ﬀ ers from human bias. We propose a deep learning-based spike sorting method on extracellular recordings from a single electrode that is e ﬃ cient, robust to noise, and accurate. In circumstances where labelled data does not exist, we created pseudo-labels through principal component analysis and K-means clustering to be used for multi-layer perceptron training and built high performing spike classiﬁcation model. When tested, our model outperformed conventional methods by 2.1% on simulation data of various noise levels, by 6.0% on simulation data of various clusters count, and by 1.7% on in-vivo data. As a result, we showed that the deep learning-based classiﬁcation can classify spikes from extracellular recordings, even showing high classiﬁcation accuracy on spikes that are di ﬃ cult even for manual classiﬁcation.


Introduction
Much of neuroscience studies require the analysis of extracellular recordings which is a measurement of neuronal activities. It measures neural signals under various behaviors through microelectrodes inserted into the brain tissue. When obtaining such extracellular recordings, neuronal activities from multiple neurons surrounding each of the microelectrodes are simultaneously recorded. These neurons each show distinct waveform shape of action potential that we call "spikes" depending their morphology, distance from the electrode, intrinsic membrane properties, and the surrounding environment [1][2][3][4]. It is important to match the spikes to the corresponding neurons, i.e., spike sorting.

Dataset
We used two publicly available simulated spike signal datasets and an in-vivo spike signal dataset for our study. First, by using the simulated spike signals where each spike is labeled, we will present the accuracy of the tested models, and through testing on the in-vivo dataset, we will show the applicability of the models in real life applications. Dataset 1 is a simulated dataset provided from University of Leicester that is widely used in spike signal research [12,27,28]. The dataset was generated at sampling rate of 24 kHz. It is composed of four subset data named Easy1, Easy2, Difficult1, and Difficult2. Each subset data contains spikes from three different neuron types under four conditions of noise levels, 0.05, 0.10, 0.15, and 0.20, where the numbers represent the standard deviation of noise. Spikes are detected by setting a threshold value of 4σ n where action potentials values above the threshold are considered as spikes. σ n is obtained as follows: The dataset is publicly available at https://www2.le.ac.uk/departments/engineering/research/ bioengineering/neuroengineering-lab/spike-sorting. Dataset 2 is a simulation data containing various number of neural signal types [29]. This was recorded at a sampling rate of 24 kHz, and threshold value was set to 5σ n . For our experiment, five datasets each for different numbers of clusters (2~5) were selected. Dataset 2 is publicly available at https://www135.lamp.le.ac.uk/hgr3/. The in-vivo dataset was acquired from the cortex of a mouse using a tetrode. Neural activities were amplified (×10,000), filtered (600 Hz-6 kHz), and digitized (30.3 KHz) using Digital Lynx (Neuralynx, Bozeman, MT, USA) [30][31][32]. Single units were manually isolated from trained expert using SpikeSort 3D (Neuralynx) and cluster quality was assessed by L-ratio, isolation distance, cross-correlation, and inter-spike interval (ISI) [33]. Single channel recordings from the tetrode recordings were used for our experiment.

Deep Learning Model
The multiplayer perceptron (MLP) is a widely-used neural network architecture for classification tasks; its low computational cost, structure simplicity, and comparably small dependence on the training data size make it a readily applicable neural network algorithm for classification [34]. In general, MLP is composed of three main components: The input layer, the hidden layers, and the output layer. Each layer is composed of nodes which receives the outputs from the previous layer as their input, referred to as activation values, s i . The activation values are multiplied by the respective weights w ij and summed at consecutive layer nodes, followed by a chosen activation function f. The output of each nodes y j at layer l is described by the following equation: where K l is the number of nodes at the lth layer. While training, w ij are adjusted to minimize the cross-entropy loss function described as the following: where s ans is the output probability of correct classification. For our model, the input layer is a 32-dimensional layer where each dimension corresponds to the 32 sampling points of the data. As for the hidden layers, four cascades of layers each with 256 nodes were used. The number of layers and nodes for each layer were chosen heuristically since there are no known analytical methods for obtaining such parameters [35]. As for the activation function, we have used hyperbolic tangent function for the hidden layers, while for the output layer, which reduces dimension to equal classification categories and gives classification probability of each classification category, Softmax (1) activation function is applied so that the probability sum is equals to 1: where K l is the number of nodes in the output layer. For our experiment, K l is equal to the number of neural signal types. For the training data, α% of the PCA + K-means clustered data nearest to the cluster centers were used under the assumption that data near centers of cluster have lower probability of being misclassified; α was set to 10 for our model. The methodology of PCA and k-means is further explained in Section 2.3.3. Figure 1 outlines the overall structures of the proposed model, and Figure 2 shows the structure of the MLP used in the model. The proposed spike sorting algorithm was written in Python and implement with Keras using a Tensorflow backend. When performed on a dataset with 3526 spikes using an Intel Xeon E5-1620 CPU equipped with a Nvidia Titan X GPU and 96 GB RAM, the model took approximately 9 s for training and 3 s for testing.

Mean Squared Error-Based Template Matching
A basic template matching method for spike classification is mean square error (MSE)-based template matching [18]. By using the cluster center waveform of each PCA + K-means clustered data as the template waveforms, MSE between each test data signal and the template waveforms were obtained. Test signals were then assigned to the corresponding template cluster which resulted in the smallest MSE.

Cross-Correlation
Cross-correlation is a readily utilized method for signal analysis. It allows a measure of association between signals, thus can be used to give a critical altered perspective of signal data and serve as a basis for many advanced signal-processing methods [20,21]. This is especially true for brain signal analysis of EEG, where many reports have been made utilizing cross-correlation values in various tasks such as classification of motor imagery, mental imagery, and epilepsy signals [34,36]. Therefore, we speculate its effectiveness on our extracellular recordings as well and have applied it as a means for comparison.
For signal classification, each signal is classified to the corresponding signal cluster with the template signal which resulted in the highest Pearson correlation coefficient.

Mean Squared Error-Based Template Matching
A basic template matching method for spike classification is mean square error (MSE)-based template matching [18]. By using the cluster center waveform of each PCA + K-means clustered data as the template waveforms, MSE between each test data signal and the template waveforms were obtained. Test signals were then assigned to the corresponding template cluster which resulted in the smallest MSE.

Cross-Correlation
Cross-correlation is a readily utilized method for signal analysis. It allows a measure of association between signals, thus can be used to give a critical altered perspective of signal data and serve as a basis for many advanced signal-processing methods [20,21]. This is especially true for brain signal analysis of EEG, where many reports have been made utilizing cross-correlation values in various tasks such as classification of motor imagery, mental imagery, and epilepsy signals [34,36]. Therefore, we speculate its effectiveness on our extracellular recordings as well and have applied it as a means for comparison.
For signal classification, each signal is classified to the corresponding signal cluster with the template signal which resulted in the highest Pearson correlation coefficient.

Mean Squared Error-Based Template Matching
A basic template matching method for spike classification is mean square error (MSE)-based template matching [18]. By using the cluster center waveform of each PCA + K-means clustered data as the template waveforms, MSE between each test data signal and the template waveforms were obtained. Test signals were then assigned to the corresponding template cluster which resulted in the smallest MSE.

Cross-Correlation
Cross-correlation is a readily utilized method for signal analysis. It allows a measure of association between signals, thus can be used to give a critical altered perspective of signal data and serve as a basis for many advanced signal-processing methods [20,21]. This is especially true for brain signal analysis of EEG, where many reports have been made utilizing cross-correlation values in various tasks such as classification of motor imagery, mental imagery, and epilepsy signals [34,36]. Therefore, we speculate its effectiveness on our extracellular recordings as well and have applied it as a means for comparison.
For signal classification, each signal is classified to the corresponding signal cluster with the template signal which resulted in the highest Pearson correlation coefficient.

PCA K-Means Clustering
There exist immense number of clustering algorithms. Few popular algorithms are fuzzy C-means clustering [37], subtractive clustering [38], Density-based clustering [39], and K-means [15]. For our model, we have chosen K-means as our clustering method as it is one of the most popular clustering algorithms used in literature [40]. K-means clustering is an iterative clustering method which represents each of k clusters C i , where i = {1, . . . , k}, by their mean, i.e., the centroid c i . As such, each data points are classified to the cluster C i which results in the smallest Euclidean distance (1) between the corresponding c i and the point.
K-means, however, is sensitive to the dimensionality of the data where its accuracy drops and susceptibility to outliers rises with increasing dimensions [41]. Therefore, we have implemented a classical dimensionality reduction technique, namely PCA, that is commonly used before applying K-means [14]. PCA reduces dimensionality of data while conserving the correlation structure of the variables and capturing the variability in the data. Given X is the original data matrix and Y = (y 1 , . . . , y n ), The principal components can be obtained through taking the eigenvectors of the covariance matrix where the eigenvalues rank the eigenvectors in the order of variance explained [14]. For our experiment, the data was reduced to 3-dimensions using three principal components with the highest principal directions, and the number of clusters, k, was determined through manual inspection of the data projected on the three principal components.

Results
We tested our model and compared the results with the presented contemporary models on both the single channel simulation datasets and the in-vivo dataset. When recording from a single electrode, signals from multiple neurons are detected. These neurons fire independently, and more than one neural signal types may be recorded at the same time resulting in overlapping signals. Such overlapping signals are considered either as outliers or more challenging data to classify. Table 1 shows the tested classification results as the ratio between the correctly classified spikes and the total number of spikes with and without overlapping spikes on Simulation dataset 1. Our method gives high accuracy overall in both cases where overlapping spikes are included and excluded. There exist few cases where our model scored sub-optimally on data with less than 10% noise. However, in all such cases, our model showed over 98% accuracy, and the difference between the best performing model was miniscule with less than 0.32%. On the contrary, for the Difficult data with high noise level, our model outperformed other methods consistently with noticeable gap. On Difficult 2 data with 20% noise, our model showed 71.15% accuracy on data without overlapping spikes and 51.55% accuracy on data with overlapping spikes. These results when compared to the second-best performing correlation-based template matching are 3.4% and 18.31% higher and compared to the lowest performing MSE-based template matching are 8.43% and 19.55% higher.  Figure 3 shows classification results on simulation dataset 2. Overall, even in cases of varying cluster count and noise levels, our model showed high classification accuracy. In cases where noise did not exist, MSE-based template matching method and our proposed method both showed best performance independent of cluster count. However, with the addition of noise, the accuracy of the compared methods began to drop. When cluster count was 2, all models showed high classification accuracy of above 99.8% with the addition of 10% noise. But with the addition of 20% noise, MSE-based template matching, correlation-based template matching, and PCA + K-means dropped in accuracy by 5.21%, 6.68%, and 1.7%, respectively. The proposed method showed drop of mere 0.29% in accuracy. When cluster count was 3, addition of 10% noise resulted in accuracy drop of 6.36%, 6.54%, and 2.28% for MSE-based template matching, correlation-based template matching, and PCA + K-means. The addition of 20% noise resulted in additional accuracy drop of 13.37%, 16.37%, and 5.87% for MSE-based template matching, correlation-based template matching, and PCA + K-means. The proposed method dropped in accuracy by 1.6% for the addition of 10% noise and 1.61% for the addition of 20% noise. When cluster count was 4, overall performance dropped with the accuracy of 96.70%, 95.73%, 94.45%, and 91.78% for the proposed model, MSE, correlation, and PCA + K-means, respectively. It showed accuracy drop of 2.51%, 3.79%, 7.67%, 2.95% with 10% noise and additional drop of 7.94%, 24.96%, 24.06%, 15.83% with 20% noise. When cluster count was 5, it showed accuracy drop of 8.47%, 14.91%, 15.97%, 20.15% with 10% noise. The addition of 20% noise resulted in additional accuracy drop of 2.89%, 11.6%, 14.45% for the proposed method, MSE-based template matching, and correlation-based template matching method, respectively, but showed increase in accuracy of 3.12% for PCA + K-means.
Our method showed the best performance in in-vivo dataset as well. Figure 4 shows the classification results of spikes from in-vivo in 3-dimensional PCA feature domain. Compared to the manual classification results, our method showed the highest accuracy of 99.39%, and MSE-based template matching, correlation-based template matching, and PCA + K-means resulted in accuracy of 97.73%, 94.90%, and 83.02%, respectively. As seen in Figure 4, all four methods were successful at classifying the red and the blue clusters. However, for the yellow cluster, compared to our method, the other models were highly inaccurate in classifying the spikes on the boundary. classification results of spikes from in-vivo in 3-dimensional PCA feature domain. Compared to the manual classification results, our method showed the highest accuracy of 99.39%, and MSE-based template matching, correlation-based template matching, and PCA + K-means resulted in accuracy of 97.73%, 94.90%, and 83.02%, respectively. As seen in Figure 4, all four methods were successful at classifying the red and the blue clusters. However, for the yellow cluster, compared to our method, the other models were highly inaccurate in classifying the spikes on the boundary.

Discussion
Detection and classification of action potentials from extracellular recording is essential for analysis of single neural signal activity. The template matching method is a classic spike sorting method that can resolve many of the issues that comes along with it. It classifies by measuring the similarities between each spike and the templates, and it shows high performance depending on the quality of extracted templates. We propose an unsupervised deep learning-based classification model that can learn spike waveforms without labels and classify each spike. Our model learned the characteristics of spike clusters by using the representative spikes of the cluster as the training data and showed superior classification performance compared to other template matching methods. Our proposed method showed higher classification accuracy in both simulation data and in-vivo data especially under high levels of noise. Because MSE and cross-correlation sums all the sampling points of a signal individually, it is susceptible to noise. This is apparent in both Simulation dataset 1 and Dataset 2. On Simulation dataset 1, our method showed the highest performance on the most challenging Difficult 2 with noise level of 0.20 data and had the least amount of performance drop with the addition of noise on simulation dataset 2 independent of cluster count. However, using a deep learning model does have higher computational cost compared to other machine learning algorithms. For our model, due to the simple structure of the model, the classification of 3526 spikes required approximately only 12 s.

Discussion
Detection and classification of action potentials from extracellular recording is essential for analysis of single neural signal activity. The template matching method is a classic spike sorting method that can resolve many of the issues that comes along with it. It classifies by measuring the similarities between each spike and the templates, and it shows high performance depending on the quality of extracted templates. We propose an unsupervised deep learning-based classification model that can learn spike waveforms without labels and classify each spike. Our model learned the characteristics of spike clusters by using the representative spikes of the cluster as the training data and showed superior classification performance compared to other template matching methods. Our proposed method showed higher classification accuracy in both simulation data and in-vivo data especially under high levels of noise. Because MSE and cross-correlation sums all the sampling points of a signal individually, it is susceptible to noise. This is apparent in both Simulation dataset 1 and Dataset 2. On Simulation dataset 1, our method showed the highest performance on the most challenging Difficult 2 with Appl. Sci. 2020, 10, 301 8 of 11 noise level of 0.20 data and had the least amount of performance drop with the addition of noise on simulation dataset 2 independent of cluster count. However, using a deep learning model does have higher computational cost compared to other machine learning algorithms. For our model, due to the simple structure of the model, the classification of 3526 spikes required approximately only 12 s.
With increasing number of clusters, probability of similarly shaped waveforms appearing rises and the distance between each cluster in PCA feature domain may decrease, making spike classification more difficult. For Simulation dataset 2 without noise, although all methods showed high accuracy of above 99.8% when the cluster count was two, none were able to score over 97% when the cluster count rose to four. This, however, does not mean that classification accuracy always drops with increasing number of clusters. On Simulation dataset 2 with cluster count of five, all methods showed accuracy of above 98.5%. This shows that data with four clusters contained comparatively more challenging data for classification.
The performance of the deep learning classification model heavily depends on the goodness of the training data. And because PCA + K-means clustered data, used as the pseudo-label for our model, is not a perfect clustering algorithm where error rates increase for data points near the cluster boundaries, increasing the amount of training data also increases the risk of using mislabeled training data. Moreover, the cluster centers chosen by the K-means clustering may fail to correctly find true cluster centers. As shown in Figure 5A, when oval shaped clusters are closely located to each other, the cluster centers may be positioned at the boundary of the clusters. In such cases, using large number of spikes for the training data increases the risk of using misclassified spikes as the training data even more. According to our results shown in Figure 5B, using 10% of the data nearest to the cluster centers as the training data showed the highest performance for such data. For well PCA + K-means clustered data, there were nearly no differences in the performance when more than 10% of the data was used for training. Therefore, our experiment was performed using 10% of data nearest to cluster centers for training.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 11 number of spikes for the training data increases the risk of using misclassified spikes as the training data even more. According to our results shown in Figure 5B, using 10% of the data nearest to the cluster centers as the training data showed the highest performance for such data. For well PCA + Kmeans clustered data, there were nearly no differences in the performance when more than 10% of the data was used for training. Therefore, our experiment was performed using 10% of data nearest to cluster centers for training. When proceeding with the general K-means clustering, when two clusters are closely located or when cluster shape differs from the spherical shape, it fails to accurately distinguish the cluster boundaries [42]. Therefore, there needs further refinement of the K-means clustering results for good clustering performance. In this research, we have shown that a deep learning-based method can successfully improve the limited spike sorting results from K-means clustering.
As shown in Table 1, although our model compared to other methods shows superior classification performance, it still shows an unsatisfactory accuracy of 71% on Difficult 2 with a noise level of 0.2. This can be explained by the structural similarity of the waveforms and high noise levels as seen in Figure 6A,B. Not only does this make the clustering process more difficult, it also makes template extracting process challenging, which is where our model struggled most on. As shown in Figure 6C, template extraction method through PCA + K-means method fails even with human inspection. However, when given 10% of the data with labels as the training data, our MLP model shows near perfect classification accuracy even on Difficult 2 with noise level of 0.2 with and without overlapping spikes with score of 99.55% and 99.70%, respectively. This shows that if clustering When proceeding with the general K-means clustering, when two clusters are closely located or when cluster shape differs from the spherical shape, it fails to accurately distinguish the cluster boundaries [42]. Therefore, there needs further refinement of the K-means clustering results for good clustering performance. In this research, we have shown that a deep learning-based method can successfully improve the limited spike sorting results from K-means clustering.
As shown in Table 1, although our model compared to other methods shows superior classification performance, it still shows an unsatisfactory accuracy of 71% on Difficult 2 with a noise level of 0.2. This can be explained by the structural similarity of the waveforms and high noise levels as seen in Figure 6A,B. Not only does this make the clustering process more difficult, it also makes template extracting process challenging, which is where our model struggled most on. As shown in Figure 6C, template extraction method through PCA + K-means method fails even with human inspection. However, when given 10% of the data with labels as the training data, our MLP model shows near perfect classification accuracy even on Difficult 2 with noise level of 0.2 with and without overlapping spikes with score of 99.55% and 99.70%, respectively. This shows that if clustering techniques more advanced than PCA + K-means is utilized and the accuracy of training data increase, classification performance of our model can be expected to rise.
boundaries [42]. Therefore, there needs further refinement of the K-means clustering results for good clustering performance. In this research, we have shown that a deep learning-based method can successfully improve the limited spike sorting results from K-means clustering.
As shown in Table 1, although our model compared to other methods shows superior classification performance, it still shows an unsatisfactory accuracy of 71% on Difficult 2 with a noise level of 0.2. This can be explained by the structural similarity of the waveforms and high noise levels as seen in Figure 6A,B. Not only does this make the clustering process more difficult, it also makes template extracting process challenging, which is where our model struggled most on. As shown in Figure 6C, template extraction method through PCA + K-means method fails even with human inspection. However, when given 10% of the data with labels as the training data, our MLP model shows near perfect classification accuracy even on Difficult 2 with noise level of 0.2 with and without overlapping spikes with score of 99.55% and 99.70%, respectively. This shows that if clustering techniques more advanced than PCA + K-means is utilized and the accuracy of training data increase, classification performance of our model can be expected to rise.  For further work, we plan to modify the model to spike sort extracellular recordings using MEAs where the spatial information is utilized to further increase classification performance. Moreover, the model performance will be tested on other dimensionality reduction and clustering algorithms, such as density-based clustering algorithms, to further improve classification accuracy and make the procedure fully automatic.

Conclusions
Accurate spike classification is an essential procedure in signal analysis of extracellular recordings. In this paper, we propose a unsupervised deep learning-based template matching spike sorting method. The results demonstrate that in 1-dimensional spike data analysis, our deep learning-based MLP model successfully improves on PCA + K-means clustering for higher classification performance and robustness to noise compared to other conventional methods.