An Accurate and Robust Method for Spike Sorting Based on Convolutional Neural Networks

In the fields of neuroscience and biomedical signal processing, spike sorting is a crucial step to extract the information of single neurons from extracellular recordings. In this paper, we propose a novel deep learning approach based on one-dimensional convolutional neural networks (1D-CNNs) to implement accurate and robust spike sorting. The results of the simulated data demonstrated that the clustering accuracy in most datasets was greater than 99%, despite the multiple levels of noise and various degrees of overlapped spikes. Moreover, the proposed method performed significantly better than the state-of-the-art method named “WMsorting” and a deep-learning-based multilayer perceptron (MLP) model. In addition, the experimental data recorded from the primary visual cortex of a macaque monkey were used to evaluate the proposed method in a practical application. It was shown that the method could successfully isolate most spikes of different neurons (ranging from two to five) by training the 1D-CNN model with a small number of manually labeled spikes. Considering the above, the deep learning method proposed in this paper is of great advantage for spike sorting with high accuracy and strong robustness. It lays the foundation for application in more challenging works, such as distinguishing overlapped spikes and the simultaneous sorting of multichannel recordings.


Introduction
Analyzing the electrophysiological activities of neurons is a basis for exploring brain functions. Neurons in the brain communicate with each other by propagating action potentials, also referred to as "spikes" [1][2][3]. Hence, unraveling the interactions between different neurons is a key step to advancing our understanding of brain science and neural engineering [4]. Prior to interpreting the information carried by the interactions, an essential operation is to separate the spikes of individual neurons picked up by an extracellular electrode [5]. That is, it is necessary to determine which spike corresponds to which neuron, because each electrode records the extracellular field in its vicinity and can detect the spikes emitted by several neurons adjacent to the electrode site [6][7][8][9]. This identification is universally termed spike sorting [10][11][12], which is conducted on the basis of the similarity of spike waveforms. The reason lies in the fact that each neuron fires spikes of a particular shape, depending on the morphology of its dendritic tree, its distance and orientation relative to the recording site, etc. [11,13].
In fact, there are usually four main steps in conventional approaches for spike sorting: (1) Bandpass-filtering (e.g., 300-3000 Hz) the recorded raw data to eliminate the interferences of

Architecture of the 1D-CNN Model
The standard architecture involved in a CNN includes the convolutional layer, rectified linear activation function, pooling layer, and fully connected layer [36]. Figure 1 illustrates the proposed CNN architecture working on a simulated database. The model contained four convolutional layers and two pooling layers, followed by a fully connected neural network and a softmax classifier. The CNN model used for the experimental data had a similar architecture. The only difference was the length of the input layer, whereby the former had a value of 64 and the latter had a value of 12. The details of the model are described as below.

Architecture of the 1D-CNN Model
The standard architecture involved in a CNN includes the convolutional layer, rectified linear activation function, pooling layer, and fully connected layer [36]. Figure 1 illustrates the proposed CNN architecture working on a simulated database. The model contained four convolutional layers and two pooling layers, followed by a fully connected neural network and a softmax classifier. The CNN model used for the experimental data had a similar architecture. The only difference was the length of the input layer, whereby the former had a value of 64 and the latter had a value of 12. The details of the model are described as below.

Convolutional layer:
The main building block of a CNN is the convolutional layer, which completes most of the computationally intensive lifting and extracts features from the input signals. In this paper, four convolutional layers were arranged into nine feature maps [36]. The original intention for this choice was because, when the number of convolutional layers was greater than four, the accuracy nearly stopped growing. For all convolutional layers, the stride was 1, the kernel size was 3, and the padding was the same. If the stride is too small, the extracted features are more comprehensive and do not miss too much information. Due to that, the stride was set to 1. Compared with using a larger convolution kernel, using smaller convolution kernels can obtain more feature information with a smaller amount of calculation. Thus, the kernel size of the convolutional layer was set to 3. We filled the edges of the input matrix with zero padding, allowing us to filter the edges of the input matrix. One of the advantages of zero padding was that it allowed us to control the size of the feature map.

Convolutional layer:
The main building block of a CNN is the convolutional layer, which completes most of the computationally intensive lifting and extracts features from the input signals. In this paper, four convolutional layers were arranged into nine feature maps [36]. The original intention for this choice was because, when the number of convolutional layers was greater than four, the accuracy nearly stopped growing. For all convolutional layers, the stride was 1, the kernel size was 3, and the padding was the same. If the stride is too small, the extracted features are more comprehensive and do not miss too much information. Due to that, the stride was set to 1. Compared with using a larger convolution kernel, using smaller convolution kernels can obtain more feature information with a smaller amount of calculation. Thus, the kernel size of the convolutional layer was set to 3. We filled the edges of the input matrix with zero padding, allowing us to filter the edges of the input matrix. One of the advantages of zero padding was that it allowed us to control the size of the feature map.

Rectified linear activation function and regularization techniques:
The rectified linear unit (Relu) layer is used for addressing the problem of optimization by mapping nonlinearity into the data [36]. The reason we chose Relu was that it could be used to alleviate the problem of gradient disappearance. It was designed as an activation function for layers 1, 2, 4, 6, 7, and 8. The softmax function was employed in layer 9 (last layer of the network). Using the softmax classifier, the prediction of a cluster to which the input data belong can be realized. The batch normalization, which kept the inputs closer to a normal distribution, was applied to the output of the convolutional layer. The dropout [46], another regularization technique, was applied to reduce overfitting by randomly setting the values of some input neurons to zero before the fully connected layers.
Pooling layer: The max pooling layer is implemented to reduce the computational complexity of conventional sorting algorithms. Max pooling compares every pair of numbers and outputs the bigger value. In this paper, two max pooling layers were used with a kernel size of 3. The stride was set to 1, and the filter convolved around different layers of the feature map by sliding one unit each time.
Fully connected layer: The output dimensions of the final connected layer depend on the number of classes. For an n-class (n kinds of spikes) problem, the output dimensions were set to n. There were three kinds of spikes in the simulated data, that is, the number of output dimensions was three. In the experimental data, we separately set-up the output dimensions for each channel, depending on the number of clusters determined by the spike sorting algorithm and human intervention.
As shown in Figure 1, the input layer (layer 0) with a resolution of 64 × 1 was convolved with 32 filters of kernel size 3 to form the first layer (layer 1). The second convolutional layer with a kernel size of 3 (64 filters) was applied to produce the second layer. A max pooling layer was employed with a kernel size of 2 (layer 3). A convolutional operation was administered on layer 3 (32 × 64) to form layer 4. The feature maps from layer 4 were once again pooled with a kernel size of 2 to construct the last max pooling operation (layer 5). Then, another round of convolution was employed with a kernel size of 3. The next three layers were the fully connected layers. The neurons of feature maps in layer 6 were fully connected to 300 neurons in layer 7. Layer 7 and layer 8 were respectively fully connected to 100 and n outputs in layers 8 and 9.

Simulated Database and Experimental Datasets
The simulated database created in "Wave_Clus" [14] has been widely used in the evaluation of several spike sorting algorithms [42,44,47]. In total, 594 different average spike waveforms compiled from real recordings were adopted as templates to construct simulated signals. Randomly selected spikes were superimposed at random times and amplitudes to mimic the background noise. Then, it was feasible to realize different signal-to-noise ratios (SNRs) by altering the ratio between the amplitudes of the noise and signals. More details related to the generation of the simulated database can be found in [43].
There were four big datasets used in our experiments, i.e., C_Easy1, C_Easy2, C_Difficult1, and C_Difficult2. C_Easy1 contained eight noise levels ranging from 0.05 to 0.4 with a step of 0.05. The other three datasets contained four levels ranging from 0.05 to 0.2. "Easy" and "difficult" were set to characterize the overlapping degrees between spikes. Spikes of 64 sampling points were extracted from the continuous data by using the ground-truth spike times denoted in each dataset. Obviously, these spikes could be treated as labeled samples, making them suitable for testing our supervised deep learning method. The sampling rate of the simulated database was 24 kHz. Detailed information of the database is provided in Figure 2.
The experimental datasets were obtained from public databases (Collaborative Research in Computational Neuroscience, CRCNS) [48,49]. The data were recorded from the primary visual cortex (V1) of a macaque monkey with 64-site multi-electrode arrays (eight penetrations, eight sites per penetration). The number of electrode contacts ranged from 65 to 128. After the spikes were isolated via superparamagnetic clustering, manual resorting was conducted to avoid possible errors. The sampling rate of the experimental datasets was 24.4 kHz. The dataset provided spikes of single Brain Sci. 2020, 10, 835 5 of 16 neurons, including their waveforms and firing times. Here, we removed the channels with very few spikes in at least one cluster (the number of spikes was less than 0.5% of the total), because there were not enough labeled samples to train the 1D-CNN. Figure 3 lists the remaining channels with detailed cluster information, which were used to test the performance of our deep learning method.  The experimental datasets were obtained from public databases (Collaborative Research in Computational Neuroscience, CRCNS) [48,49]. The data were recorded from the primary visual cortex (V1) of a macaque monkey with 64-site multi-electrode arrays (eight penetrations, eight sites per penetration). The number of electrode contacts ranged from 65 to 128. After the spikes were isolated via superparamagnetic clustering, manual resorting was conducted to avoid possible errors. The sampling rate of the experimental datasets was 24.4 kHz. The dataset provided spikes of single neurons, including their waveforms and firing times. Here, we removed the channels with very few spikes in at least one cluster (the number of spikes was less than 0.5% of the total), because there were not enough labeled samples to train the 1D-CNN. Figure 3 lists the remaining channels with detailed cluster information, which were used to test the performance of our deep learning method.

Training and Testing
Six experiments were performed for each dataset. The proportion of data used for the training and testing procedures is illustrated in Figure 4. For each dataset, the training set separately held 5%, 10%, 20%, 30%, 40%, and 50% of the total data, and the corresponding testing set held the remaining 95%, 90%, 80%, 70%, 60%, and 50%, respectively. We separately used 50% of the total data in

Training and Testing
Six experiments were performed for each dataset. The proportion of data used for the training and testing procedures is illustrated in Figure 4. For each dataset, the training set separately held 5%, 10%, 20%, 30%, 40%, and 50% of the total data, and the corresponding testing set held the remaining 95%, 90%, 80%, 70%, 60%, and 50%, respectively. We separately used 50% of the total data in C_E1_015 and C_D1_015 as the training set and the remaining data as the validation set to adjust and evaluate our model. When our model was determined, other datasets were used to evaluate the performance, and they were divided into a training set and testing set.
To evaluate the performance of the model, accuracy, which was calculated as the percentage of correctly classified samples over all data, was utilized to compute the score of the entire classification, and it was used in the analysis of experimental data. In addition, considering the prevalent imbalance of data distribution in the experimental data, the macro-average F-measure (Macro_F) was also employed to describe the classification effects. Macro_F is more sensitive to the classification quality, and it is defined as follows [50]: where

Simulated Database
In this section, the classification accuracy of the proposed 1D-CNN model is compared with "WMsorting", the deep-learning-based MLP model, and four traditional methods using the "Wave_Clus" database. The four traditional methods utilized PCA-based and correlation-based (CORR) feature extraction with FCM as the clustering method and two other public methods (fusion + SVM and FSDE + K-means). Details of these traditional methods can be found in [44]. The results for the data with different noise levels are shown in Tables 1 and 2.

Simulated Database
In this section, the classification accuracy of the proposed 1D-CNN model is compared with "WMsorting", the deep-learning-based MLP model, and four traditional methods using the "Wave_Clus" database. The four traditional methods utilized PCA-based and correlation-based (CORR) feature extraction with FCM as the clustering method and two other public methods (fusion + SVM and FSDE + K-means). Details of these traditional methods can be found in [44]. The results for the data with different noise levels are shown in Tables 1 and 2. Although "WMsorting" showed relatively high accuracy on the C_Easy1, C_Easy2, and C_Difficult2 datasets, it was not robust. The accuracy of "WMsorting" on C_Difficult1_020 was 85.38% and 86.45% when the feature dimensions were 3 and 10, respectively, which were much lower than the results of other datasets. The feature dimension is an important parameter in sorting evaluation, and it is determined with reference to the best Micro_F (micro-average F-measure) [51] score of the PCA-based method. On the other hand, the accuracy of our proposed deep learning method on C_Difficult1_020 was above 98% except for experiment 1, whose accuracy was 95.16%. That is, even when we used only 5% of the spikes to train the 1D-CNN, we still obtained an accuracy improvement of 8.71%. In fact, when the number of labeled spikes increased, i.e., in the other five experiments, our results could enhance the clustering accuracy by more than 10% compared to "WMsorting." In the other datasets, although the enhancement was not so significant, our deep learning method behaved better than "WMsorting." Additionally, the improved accuracy is shown in the last column of Table 1, indicating that the best accuracy of our proposed model was relatively higher than the best accuracy of "WMsorting" on all datasets.
Compared to traditional methods, our proposed deep learning method had more obvious advantages. Traditional methods worsened with increasing noise level for all datasets, with an error rate as high as 46%, while our error rate was around 2%. For lower noise levels, our accuracy was still better than that of traditional methods.
The deep-learning-based MLP model used 10% of the data nearest to the cluster centers as the training data; thus, we used experiment 2 (10% of the total data as training set) in the last column of Table 2. As can be seen, the deep-learning-based MLP model did not perform well with a higher noise level, especially on the C_Difficult2_020 or C_Difficult2_015 datasets. The deep-learning-based MLP model's accuracy on C_Difficult2_020 was 51.55%, which was 42.8% lower than ours.
As the number of spikes used in the training set gradually increased (from 5% to 50% of the total), the accuracy exhibited a slight tendency to increase (0.607% on average). It was also shown that 5% (170 spikes) of the data were enough to train the proposed 1D-CNN model in most cases. On the other hand, with the increase in noise level (from 5% to 20%), the accuracy had no significant reduction. Both factors confirmed that our proposed deep learning method is robust.
Furthermore, we took experiment 6 for two datasets (C_Easy1_noise015 and C_Difficult1_noise015) as an example to illustrate the results in more detail. First, we observed the accuracy of individual classes by constructing confusion matrices, as shown in Figure 5a,b. Each row in the confusion matrix represents a true label, while each column represents a predicted label. The bold percentages denote the proportion of correct true labels for each cluster. Clearly, there was only a small difference between the accuracy of individual classes and the overall accuracy, which is consistent with the balanced spike distribution of each cluster in the "Wave_Clus" database. That is, the proposed 1D-CNN model performed excellently for different clusters, indicating that it is an accurate and robust method for spike sorting.
Moreover, the accuracy for the C_Difficult1_noise015 dataset was clearly lower than that for C_Easy1_noise015. In order to explain the reason behind this, we plotted the correctly clustered spikes in Figure 6 and the falsely classified spikes in Figure 6. Obviously, the spike waveform shapes of the three clusters in C_Easy1_noise015 were quite different and relatively easy to distinguish. On the other hand, the spike waveform shapes in C_Difficult1_noise015 were similar to each other and difficult to distinguish.
in the confusion matrix represents a true label, while each column represents a predicted label. The bold percentages denote the proportion of correct true labels for each cluster. Clearly, there was only a small difference between the accuracy of individual classes and the overall accuracy, which is consistent with the balanced spike distribution of each cluster in the "Wave_Clus" database. That is, the proposed 1D-CNN model performed excellently for different clusters, indicating that it is an accurate and robust method for spike sorting. Spike2 Spike3 True Labels 1132 0  Moreover, the accuracy for the C_Difficult1_noise015 dataset was clearly lower than that for C_Easy1_noise015. In order to explain the reason behind this, we plotted the correctly clustered spikes in Figure 6 and the falsely classified spikes in Figure 6. Obviously, the spike waveform shapes of the three clusters in C_Easy1_noise015 were quite different and relatively easy to distinguish. On the other hand, the spike waveform shapes in C_Difficult1_noise015 were similar to each other and difficult to distinguish.  Figure 7 shows all the falsely classified spikes and a small subset of the correctly classified spikes in the two datasets. As illustrated in Figure 7a,b, only one spike in "Cluster 2" and three spikes in "Cluster 3" in C_Easy1_noise015 were falsely classified due to their very similar shapes. The spikes in "Cluster 1" in C_Easy1_noise015 were all correctly classified. On the other hand, there were more falsely classified spikes in C_Difficult1_015. Clearly, they were severely distorted by the noise and had similar shapes, which hindered their sorting into the corresponding clusters. However, compared to the accuracy of 91.18% obtained using the "WMsorting" method, our proposed deep learning approach shows a significant improvement with an accuracy of 99.42%.

Amplitude
Correctly classified spikes   Figure 7 shows all the falsely classified spikes and a small subset of the correctly classified spikes in the two datasets. As illustrated in Figure 7a,b, only one spike in "Cluster 2" and three spikes in "Cluster 3" in C_Easy1_noise015 were falsely classified due to their very similar shapes. The spikes in "Cluster 1" in C_Easy1_noise015 were all correctly classified. On the other hand, there were more falsely classified spikes in C_Difficult1_015. Clearly, they were severely distorted by the noise and had similar shapes, which hindered their sorting into the corresponding clusters. However, compared to the accuracy of 91.18% obtained using the "WMsorting" method, our proposed deep learning approach shows a significant improvement with an accuracy of 99.42%. "Cluster 3" in C_Easy1_noise015 were falsely classified due to their very similar shapes. The spikes in "Cluster 1" in C_Easy1_noise015 were all correctly classified. On the other hand, there were more falsely classified spikes in C_Difficult1_015. Clearly, they were severely distorted by the noise and had similar shapes, which hindered their sorting into the corresponding clusters. However, compared to the accuracy of 91.18% obtained using the "WMsorting" method, our proposed deep learning approach shows a significant improvement with an accuracy of 99.42%.

Amplitude
Correctly classified spikes  In addition to clustering accuracy, the loss function is an evaluation index for the performance of deep learning methods. Cross-entropy is a frequently used function that measures the degree of difference between the actual outputs and expected outputs of a CNN model. In other words, a smaller cross-entropy loss function denotes the two outputs being closer. On the other hand, the loss function helps determine whether a model is overfitting or underfitting the data by comparing the training loss and testing loss. The training and testing cross-entropy losses on both datasets (experiment 6 for C_Easy1_noise015 and C_Difficult1_noise015) are plotted in Figure 8. In our proposed model, the training epoch was set to 50. Obviously, although there were occasional rises, the training loss and testing loss decreased along with the increase in the number of epochs. The testing loss of C_Easy1_015 decreased to 0.01708 at the 50th epoch, and the testing loss of C_Difficult1_015 was 0.05861 at the 50th epoch. These values were small enough to demonstrate the robustness of our proposed model. Furthermore, the small difference between the training loss and testing loss showed that our model neither overfitted nor underfitted the data.
Brain Sci. 2020, 10, x FOR PEER REVIEW 11 of 17 In addition to clustering accuracy, the loss function is an evaluation index for the performance of deep learning methods. Cross-entropy is a frequently used function that measures the degree of difference between the actual outputs and expected outputs of a CNN model. In other words, a smaller cross-entropy loss function denotes the two outputs being closer. On the other hand, the loss function helps determine whether a model is overfitting or underfitting the data by comparing the training loss and testing loss. The training and testing cross-entropy losses on both datasets (experiment 6 for C_Easy1_noise015 and C_Difficult1_noise015) are plotted in Figure 8. In our proposed model, the training epoch was set to 50. Obviously, although there were occasional rises, the training loss and testing loss decreased along with the increase in the number of epochs. The testing loss of C_Easy1_015 decreased to 0.01708 at the 50th epoch, and the testing loss of C_Difficult1_015 was 0.05861 at the 50th epoch. These values were small enough to demonstrate the robustness of our proposed model. Furthermore, the small difference between the training loss and testing loss showed that our model neither overfitted nor underfitted the data.

Experimental Datasets
Similarly, six experiments with different proportions of training data were performed on the extracellular recording spikes to further demonstrate the effectiveness of our method. In Figure 9, the classification accuracies are presented according to the number of spike clusters (ranging from two to five). It can be seen that most of the results achieved accuracies higher than 96.5% regardless of the number of labels used to train the 1D-CNN. In fact, the average accuracy of all experiments was

Experimental Datasets
Similarly, six experiments with different proportions of training data were performed on the extracellular recording spikes to further demonstrate the effectiveness of our method. In Figure 9, the classification accuracies are presented according to the number of spike clusters (ranging from two to five). It can be seen that most of the results achieved accuracies higher than 96.5% regardless of the number of labels used to train the 1D-CNN. In fact, the average accuracy of all experiments was 96.53%, which is an acceptable level. However, it should also be noted that the accuracies of some channels were significantly lower than those of others. For example, in the 94th channel, the classification accuracy of experiment 1 (5% of the data used as the training set) was 89.40%. The reason for this is that the number of spikes in the training set was 218, which was not enough to effectively train the 1D-CNN model. When the number of training spikes increased, the classification accuracy improved, e.g., it was 91.80% in experiment 2 (10% of the data used as the training set) and 94.60% in experiment 3 (20% of the data used as the training set). The above accuracies of each dataset were calculated as a function of the total spikes. As is known, there is always an imbalanced data distribution in experimental datasets; thus, Macro_F is an appropriate criterion to evaluate the classification quality. The results of this evaluation are given in Table 3. Taking the 98th channel as an example, the accuracy in experiment 1 was 95.44%, but the corresponding Macro_F score was 74.66%. Clearly, there was a great difference. The reason is that the minimum number of spikes in the three clusters was 103 and the corresponding number of spikes involved in the training set in experiment 1 was only 21. Obviously, this was not enough to adequately train the 1D-CNN to achieve a desirable output. It should also be noted that the Macro_F score increased to >90% when the number of spikes in the training set was >30% of the total number of spikes (i.e., experiments 4, 5, and 6). In fact, most Macro_F scores were above 95% in experiment 6. In the 84th and 116th channels, scores of 99% were achieved. Thus, as long as there are enough spikes for training the proposed 1D-CNN, it is very likely to obtain a close approximation of the ideal The above accuracies of each dataset were calculated as a function of the total spikes. As is known, there is always an imbalanced data distribution in experimental datasets; thus, Macro_F is an appropriate criterion to evaluate the classification quality. The results of this evaluation are given in Table 3. Taking the 98th channel as an example, the accuracy in experiment 1 was 95.44%, but the corresponding Macro_F score was 74.66%. Clearly, there was a great difference. The reason is that the minimum number of spikes in the three clusters was 103 and the corresponding number of spikes involved in the training set in experiment 1 was only 21. Obviously, this was not enough to adequately train the 1D-CNN to achieve a desirable output. It should also be noted that the Macro_F score increased to >90% when the number of spikes in the training set was >30% of the total number of spikes (i.e., experiments 4, 5, and 6). In fact, most Macro_F scores were above 95% in experiment 6. In the 84th and 116th channels, scores of 99% were achieved. Thus, as long as there are enough spikes for training the proposed 1D-CNN, it is very likely to obtain a close approximation of the ideal spike sorting results.
The experimental data were also analyzed using "WMsorting", and the Macro_F scores are shown in Table 3. We separately carried out two experiments using "WMsorting" with three and ten feature dimensions. We used the best obtained accuracy in each channel from our 1D_CNN model and "WMsorting" to calculate the improvement. As can be seen, our 1D_CNN model provided a better result.

Discussion
In this paper, we designed a 1D-CNN model to improve the accuracy and robustness of spike sorting. The novel deep learning method exhibited a high sorting accuracy in the simulated database, even when the data suffered from severe noise. In the experimental datasets, our proposed model still performed well in classifying the spikes recorded by the extracellular electrodes in the primary visual cortex of a macaque monkey. Thus, it is reasonable to conclude that our proposed model showed a good performance in terms of accuracy and robustness.
There were two factors influencing the performance of the method, i.e., sampling points and labels. First, the number of sampling points in each spike had a significant impact on the accuracy. As shown in Table 1, the accuracies of C_Easy1_015 and C_Difficult1_015 in experiment 6 are 99.77% and 99.42%, respectively. However, when we downsampled the spike waveforms from 64 points to 16 points, the corresponding accuracies were reduced to 99.14% and 87.91%. Clearly, the reduction seen for the C_Difficult1_015 dataset was much greater than that seen for C_Easy1_015. This is because the spike waveforms of different clusters in C_Difficult1_015 were more similar to each other. The lower accuracies of the experimental datasets with 12 sampling points also verified this effect. That is, with fewer spike sampling points, it was more difficult for 1D-CNN to cluster the data correctly. A possible reason is that the 1D-CNN could not "learn" enough features from the limited spike sampling points. Secondly, as with other deep learning approaches [15,52,53], the number of labels in the training set was of great importance to the clustering accuracy. From the previous analysis of simulated data, it was found that the accuracy could reach over 99.5% when there were more than 60 training spikes of each cluster available in the "easy" datasets. Moreover, the 1D-CNN needed at least 120 spikes of each cluster, with over 98.4% accuracy guaranteed in the "difficult" datasets. As far as the real data were concerned, considering fewer sampling points and severe noise, there should be more than 200 training spikes of each cluster to obtain an accuracy of 95%. Thus, there is no doubt that the accuracy increases with the number of training spikes. When considering both factors, a sufficient number of spikes with enough sampling points in the training set can reliably guarantee a high accuracy and strong robustness. On the other hand, it is time-consuming and laborious to obtain a large number of training spikes. Therefore, it will be necessary to find a tradeoff between performance and cost in real applications.
Although "WMsorting" is a semi-supervised algorithm [54], whereas the method presented here is fully supervised, the degree of manual interventions was almost the same. There were about 170 spikes in the training set for experiment 1 using our proposed method and 180 spikes using "WMsorting." Hence, the workload when using our method with respect to labeling is equivalent to that when using a semi-supervised algorithm. Meanwhile, the accuracy of our proposed deep learning method was improved with respect to "WMsorting," especially for the "difficult" datasets.
Compared to traditional methods and the deep-learning-based MLP model [45], our proposed 1D-CNN model had more obvious advantages on robustness. With the increase in noise level for all datasets, traditional methods worsened with an error rate as high as 46.28%, and the deep-learning-based MLP model worsened with an error rate as high as 48.45%. Our error rate was around 2% for all datasets, which was more robust.
We used the Anaconda software (4.4.10, Anaconda Company, Austin, USA) and implemented the proposed CNN architecture using Keras in Python 3. When performed on a dataset with 1738 training spikes and 1739 testing spikes using an Intel(R) Core(TM) i7-7820X central processing unit (CPU) equipped with a Nvidia GeForce RTX 2070 graphics processing unit (GPU) and 16 GB random-access memory (RAM), the model took approximately 17 s for training and 2 s for testing. This means that our proposed deep learning method could classify a spike in one millisecond. If the performance of the hardware were further improved, the computation time would be shorter. Thus, it is feasible to apply our 1D-CNN method in online spike sorting.

Conclusions
We proposed a novel deep learning method based on a 1D-CNN, which can be used in online or offline automatic spike sorting. In future works, we will improve this CNN model to effectively cluster overlapping spikes and expand it to simultaneously classify multichannel recordings. As is well known, the shapes of overlapping spikes can appear with various and complicated changes, whereby the superposition time and the unit waveform are always different. This complicates spike sorting. In addition, high-density multielectrode arrays have been used to record the firing activities of large ensembles of neurons. Thus, a spike waveform may be recorded by multiple electrodes at the same time. How to distinguish spikes simultaneously recorded by different electrodes then becomes another important problem to be solved in spike sorting.