An Analysis of Deep Learning Models in SSVEP-Based BCI: A Survey

The brain–computer interface (BCI), which provides a new way for humans to directly communicate with robots without the involvement of the peripheral nervous system, has recently attracted much attention. Among all the BCI paradigms, BCIs based on steady-state visual evoked potentials (SSVEPs) have the highest information transfer rate (ITR) and the shortest training time. Meanwhile, deep learning has provided an effective and feasible solution for solving complex classification problems in many fields, and many researchers have started to apply deep learning to classify SSVEP signals. However, the designs of deep learning models vary drastically. There are many hyper-parameters that influence the performance of the model in an unpredictable way. This study surveyed 31 deep learning models (2011–2023) that were used to classify SSVEP signals and analyzed their design aspects including model input, model structure, performance measure, etc. Most of the studies that were surveyed in this paper were published in 2021 and 2022. This survey is an up-to-date design guide for researchers who are interested in using deep learning models to classify SSVEP signals.


Introduction
The brain-computer interface (BCI) provides a direct communication channel between the human brain and computers without using peripheral nerves or muscles [1]. BCIs allow users to harness their brain states for controlling devices such as spelling interfaces [2,3], wheelchairs [4,5], computer games [6,7], or other assistive devices [8,9]. Among all BCIs, electroencephalography (EEG)-based BCIs are the most widely used. EEG is a non-invasive way of acquiring brain signals from the surface of the human scalp and is widely adopted in brain-computer interface applications because of its safety, convenience, and high temporal resolution [10][11][12]. There are multiple commonly used paradigms to evoke brain signals to generate the control commands for EEG-based BCIs, including P300 [13], motor imagery [14], and steady-state visual evoked potential (SSVEP) [15].
Among them, SSVEP has the advantages of less training, high classification accuracy, and a high information transfer rate (ITR) [16] and is considered to be the most suitable paradigm for effective high-throughput BCI [17]. SSVEP represents oscillatory electrical potential that is elicited in the brain when the subject is visually watching a stimulus that is flickering at a frequency of 6 Hz or above. A reorganization of spontaneous intrinsic brain oscillations in response to a stimulus will likely take place [18]. These SSVEP signals are most evident in the occipital region (visual cortex) with the fundamental frequency being the same as the stimulus and its harmonics [19].
SSVEP-based BCIs generally consist of five main processing stages: the data collection stage that records neural data; the signal preprocessing stage that preprocesses and cleans the recorded data; the feature extraction stage that extracts meaningful information from the neural data; the classification stage that determines the output of the BCI from the processed neural data mostly using machine learning methods; and the feedback stage that presents the output of the BCI to the user [20].
Compared with other classification methods based on SSVEP, deep learning has many advantages. It integrates feature extraction and classification as a single process; therefore, deep learning is more likely to acquire subtle patterns that are not observable by humans but are informative for the classification of EEG signals. Deep learning utilizes a neural network consisting of several stacked layers of neurons, with each layer trained on a distinct set of features depending on the output of previous layers. As the data flow through the network, more complex features are obtained. The network can take raw SSVEP signals as the input, without the requirement for hand-crafted feature extraction as well as common signal preprocessing steps [21,22]. This property provides a critical advantage, as it precludes implicit EEG signals or features from being lost during preprocessing or feature extraction [23].
However, the design of deep learning models varies significantly, and it is hard to predict the performance of the model by its structure. The preprocessing of data, the number of neurons, the number of layers, the choice of activation functions, the choice of training methods, and the adoption of pooling layers or the dropout technique to prevent overfitting all impact the performance of the model, and thus surveying successful deep learning models and learning from their structures is of great significance for the designing of future deep learning models.

Related Surveys
The reviews and surveys on using deep learning models to classify SSVEP from 2019 to 2023 are summarized in Table 1. As shown in Table 1, most of the surveys did not cover the detailed deep structures or hyperparameters of the deep learning models, which are critical references for designing future deep learning models. Only Craik's work covered these two areas in 2019; however, with the fast advancement of deep learning techniques, it is necessary to gather recent research results to offer up-to-date information for current researchers. This survey provides detailed deep learning model analysis which includes details of structures and hyperparameters for 31 deep learning models, most of which were published in 2021 and 2022. This survey is an up-to-date survey aimed at providing design details and design analysis for future deep learning models in SSVEP classification. Table 1. Reviews and surveys on using deep learning models to classify steady-state visual evoked potentials (SSVEPs). Here, EEG represents electroencephalography, fNIRS represents functional nearinfrared spectroscopy, MEG represents magnetoencephalography, FE represents feature extraction, ML represents machine learning, and DL represents deep learning.

References
Year

Literature Search and Inclusion Criteria
To conduct this survey, the following databases were used: PubMed, Engineering Village, ScienceDirect, IEEE Xplore, and Google Scholar. Papers were selected for survey if the following keywords appeared in their title: (1) SSVEP and (2) deep learning or an RNN or CNN or DNN or LSTM.
After further reading, papers that satisfied the following criteria remained in this survey: (1) written in English; (2) had innovations in the structural design of deep learning models; (3) had detailed information regarding model input, structure, and performance (or at least 70% of the details revealed); and (4) the deep learning model was designed to classify SSVEP signals. After selection, 31 articles remained in this survey, and the 31 deep learning models were dissected and analyzed in detail in the following content.

Quality Assessment
The Cochrane collaboration tool was used to assess the quality of the selected articles [34]. For the 31 articles included in this survey, they were classified into having (a) a low risk of bias, (b) a high risk of bias, or (c) an unclear risk of bias in six domains. The quality of the articles was categorized into weak (fewer than three low-risk domains), fair (three to five low-risk domains), or good (six low-risk domains). Of the 31 articles included in this survey, 4 of them were categorized as weak, 19 were categorized as fair, and 8 were categorized as good. The results are shown in Figure 1.

Literature Search and Inclusion Criteria
To conduct this survey, the following databases were used: PubMed, Engineering Village, ScienceDirect, IEEE Xplore, and Google Scholar. Papers were selected for survey if the following keywords appeared in their title: (1) SSVEP and (2) deep learning or an RNN or CNN or DNN or LSTM.
After further reading, papers that satisfied the following criteria remained in this survey: (1) written in English; (2) had innovations in the structural design of deep learning models; (3) had detailed information regarding model input, structure, and performance (or at least 70% of the details revealed); and (4) the deep learning model was designed to classify SSVEP signals. After selection, 31 articles remained in this survey, and the 31 deep learning models were dissected and analyzed in detail in the following content.

Quality Assessment
The Cochrane collaboration tool was used to assess the quality of the selected articles [34]. For the 31 articles included in this survey, they were classified into having (a) a low risk of bias, (b) a high risk of bias, or (c) an unclear risk of bias in six domains. The quality of the articles was categorized into weak (fewer than three low-risk domains), fair (three to five low-risk domains), or good (six low-risk domains). Of the 31 articles included in this survey, 4 of them were categorized as weak, 19 were categorized as fair, and 8 were categorized as good. The results are shown in Figure 1.

Contribution of This Survey
Other than hand-crafted approaches such as feature extraction methods or machine learning methods, where mathematical calculation can help in predicting the performance of models, the performance of deep learning models is rather unpredictable, and the design Brain Sci. 2023, 13, 483 4 of 23 process often includes a trial-and-error approach to validate the design choice of structure or hyperparameters. Thus, when using deep learning models as the classification method for SSVEP classification, the overview of detailed structures and hyperparameters of other successful deep learning models can significantly facilitate the design process, which is one of this survey's key advantages.
To the best of our knowledge, this survey is the first one aimed at dissecting the deep learning models used for SSVEP signal classification in different aspects and providing a thorough design guide for future deep learning models targeting the classification of SSVEP signals since 2019. To this end, 31 deep learning models for SSVEP classification are dissected and analyzed in detail, most of which were published in 2021 and 2022. Three key contributions are made in this survey: • Key elements of deep learning models are introduced to help readers gain a comprehensive understanding of deep learning models; • Design details of 31 deep learning models are listed to provide information and handy references for the design of future deep learning models; • Design considerations of deep learning models are analyzed and discussed, which can benefit: (1) researchers with a computer background who are interested in SSVEPbased BCI; (2) neuroscience experts who intend to construct deep learning models to classify SSVEP signals.
In sum, this survey provides a thorough and convenient guide for the future design of deep learning models for SSVEP classification.

Organization of This Survey
The rest of this survey is structured as follows: Section 2 introduces the model input and three frequently used open datasets of SSVEP signals, as well as the data preprocessing methods; Section 3 overviews model structure designs, including DNN models, long short-term memory (LSTM) models, CNN models and their components such as pooling layers, dropout, training methods, and activation functions; Section 4 discusses the design considerations and performance measures of models; Section 5 points out the current challenges and future directions; and Section 6 provides the concluding remarks.

Model Input
The quantity of training data has a crucial impact on the performance of deep learning models. The more complex a deep learning model, the more data it requires in training, otherwise its performance will not surpass a simpler deep learning model or traditional machine learning approaches [21]. In BCI research, the quantity of data can be measured by the SSVEP signal length per channel. The data lengths of the 31 deep learning studies are analyzed as references for researchers who want to collect their own data, and three frequently used public datasets are presented for researchers who are not capable of collecting SSVEP data. The preprocessing methods of input data are also introduced, as preprocessing of input data can make the features of data easier to extract, thus increasing the models' performances.

Data Length
Deep learning models require the training of the models' parameters. This usually requires a large amount of data. By using more data, the performance of the model will be enhanced. Additionally, the more complex a deep learning model is, and the more parameters there are in the model, the more data it requires to train it, otherwise its performance will not surpass simpler deep learning models or other feature extraction methods such as canonical correlation analysis (CCA), or machine learning methods such as support vector machine (SVM) [21]. However, recording EEG data from participants takes effort; thus, the size of the experimental dataset is limited. Here, the time length of the SSVEP signal in each channel in 31 deep learning studies is overviewed in Figure 2A to provide a guide for SSVEP signal length for researchers who want to prepare their own data for training deep learning models. be enhanced. Additionally, the more complex a deep learning model is, and the more parameters there are in the model, the more data it requires to train it, otherwise its performance will not surpass simpler deep learning models or other feature extraction methods such as canonical correlation analysis (CCA), or machine learning methods such as support vector machine (SVM) [21]. However, recording EEG data from participants takes effort; thus, the size of the experimental dataset is limited. Here, the time length of the SSVEP signal in each channel in 31 deep learning studies is overviewed in Figure 2A to provide a guide for SSVEP signal length for researchers who want to prepare their own data for training deep learning models.  Table 2.
The more complex a deep learning model is, the more data it needs for training. If insufficient data are used for training the deep learning model, the model will learn slight variations and noise in the training data, which are exclusive to that database and do not reflect the features of the target signal. This is known as overfitting and will harm the model's performance in testing while using data other than the training dataset [21]. As Figure 2A shows, for comparatively complex deep learning models, an SSVEP signal length between 40,000 s and 50,000 s may provide enough data to train the model if researchers want to collect their own data and use a model of a similar size to those covered in this survey. For relatively simple deep learning models, an SSVEP signal length below 10,000 s may be enough for training. Detailed data length and data point calculations are given in Table 2.

Three Frequently Used Open Datasets
Recording SSVEP signals takes effort, and many researchers choose to use open datasets to save time in obtaining EEG data and to train and validate their model. By using open datasets, it is easier to compare their methods with other methods because many other researchers have published their results based on the same dataset. Here, three frequently used open datasets by SSVEP deep learning research are summarized.  Table 2.
The more complex a deep learning model is, the more data it needs for training. If insufficient data are used for training the deep learning model, the model will learn slight variations and noise in the training data, which are exclusive to that database and do not reflect the features of the target signal. This is known as overfitting and will harm the model's performance in testing while using data other than the training dataset [21]. As Figure 2A shows, for comparatively complex deep learning models, an SSVEP signal length between 40,000 s and 50,000 s may provide enough data to train the model if researchers want to collect their own data and use a model of a similar size to those covered in this survey. For relatively simple deep learning models, an SSVEP signal length below 10,000 s may be enough for training. Detailed data length and data point calculations are given in Table 2.

Three Frequently Used Open Datasets
Recording SSVEP signals takes effort, and many researchers choose to use open datasets to save time in obtaining EEG data and to train and validate their model. By using open datasets, it is easier to compare their methods with other methods because many other researchers have published their results based on the same dataset. Here, three frequently used open datasets by SSVEP deep learning research are summarized.

Nakanishi Open Dataset
Nakanishi published their open dataset in 2015, making it the earliest and most frequently used SSVEP open dataset in deep learning research targeting SSVEP analysis [35]. In Nakanishi's study, ten healthy subjects participated in the experiment. For each subject, the experiment consisted of 15 blocks, and in each block subjects were asked to gaze at one stimulus for 4 s and then complete 12 trials corresponding to all 12 targets. The stimuli flickered for 4 s on the monitor after a 1 s break for subjects to shift their gaze. The EEG data epochs were sampled at a sampling rate of 2048 Hz with eight electrodes, and later down-sampled to 256 Hz. All data were bandpass filtered from 6 Hz to 80 Hz with an infinite impulse response (IIR) filter. Considering a latency delay in the visual system, all data epochs were extracted with a 0.135 s delay after the stimulus onset. The Nakanishi 2015 open dataset can be obtained from https://github.com/NeuroTechX/moabb (accessed on 10 March 2023). Table 2. A detailed analysis of structures of 31 deep learning models used for SSVEP analysis. NM stands for not mentioned. C is the channel number of the dataset. T is the length of the segment data. GD stands for gradient descent. SGD stands for stochastic gradient descent. ReLU stands for rectified exponential linear unit. Other abbreviations are unique representations in the original paper, so please refer to the reference. Gordon et al. [43] No      Nakanishi et al. [35] Filter bank

Wang Open Dataset
Wang presented an open dataset which included a large number of subjects (8 experienced and 27 naïve, 35 in total) in 2017 [48]. For each subject, the experiment included 6 blocks, each containing 40 trails corresponding to 40 stimuli. The visual stimuli flickered for 5 s after a 0.5 s target cue, and there was a 0.5 s rest time before the next trial began. The EEG data epochs were recorded at a sampling rate

Data Preprocessing
Data preprocessing can enhance the performance of the model by making the features easier to extract. Common techniques including frequency filters, time-frequency transforms, and filter banks are often implemented in SSVEP analysis using deep learning.

Frequency Filters
By applying frequency filters, noise can be removed from the data. Many open datasets consist of already filtered data using frequency filters including bandpass filters and notch filters. In Nakanishi's open dataset, a bandpass filter from 6 Hz to 80 Hz was applied to remove low-frequency noise and high-frequency noise, as the stimulus frequencies between 9.25 Hz and 15.25 Hz together with their harmonics were included. In Wang's open dataset, a notch filter at 50 Hz was applied to remove the power-line noise in the recording. In the BETA open dataset, a bandpass filter from 0.15 Hz to 200 Hz and a notch filter at 50 Hz were applied. Many researchers apply frequency filters in their own datasets as well, as shown in Table 2.

Time-Frequency Transform
The implementation of a time-frequency transform can make the frequency features easier to extract by the deep learning models. When time domain signals are used as the input, a more complex model is usually required to extract features, while the neural networks with frequency domain input data have a relatively simpler structure. In SSVEP deep learning research, Fast Fourier Transform (FFT) is the most widely used timefrequency transform. Kwak implemented FFT to the input data and transformed input time domain data into 120 frequency samples through 8 channels [17]. Nguyen applied FFT to single-channel data to reduce the computation time of the system and use it as the only input into a 1D CNN model for SSVEP classification [45]. Ravi applied FFT to transform 1200 time-domain samples into 110 frequency components per data segment [49]. In these studies, FFT also caused the input to contain fewer data points, thus reducing the impact of overfitting, as training data were limited.
In some studies, FFT data were processed before feeding into the model to enhance model performance. In 2020, Ravi found that CNN models using complex spectrum features that were concatenated by the real part and the imaginary part of the complex FFT have higher accuracies than the same models using the magnitude spectrum of FFT as the input [51]. Dang took FFT as the input and intercepted the spectrum sequences of the fundamental waves and two harmonics and used them as parallel inputs into the CNN model to enhance the model's performance [27].

Filter Bank
Filter bank analysis performs sub-band decompositions with multiple filters that have different pass-bands. In 2015, Chen proposed a filter bank canonical correlation analysis (FBCCA) that incorporates fundamental and harmonic frequency components together to enhance the detection of SSVEP. By adding a filter bank to CCA analysis, FBCCA significantly outperformed CCA [28], which proved the filter bank to be an efficient data preprocessing method. Recently, researchers found that filter bank analysis can be implemented to process the inputs of deep learning models as well.
Ding built and compared two CNN models in 2021, one with a filter bank and one without a filter bank. Ding found that by adding a filter bank analysis to the input of the CNN model, the classification accuracy displayed a 5.53% increase in his own dataset on average, and a 5.95% increase in a public dataset [58]. In 2022, Pan leveraged four filter banks ranging from 8×m to 80 Hz for the input data before inserting them into a CNN-LSTM network, where m ∈ {1,2,3,4} [60]. Chen also implemented three filter banks to enhance a transformer-based model's performance, which was named FB-SSVEPformer. Chen also found that, compared to using two or four filter banks, using three filter banks provided the best performance [64]. Yao built three filter banks and then fed the input to three EEGNets used as sub-networks separately before merging the features together [73]. These studies showed that a filter bank is an effective tool to process the SSVEP input and make frequency features easier to extract by the deep learning models.

Model Structure
Frequently used deep learning models can be generally categorized into three categories: fully connected neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs) [76]. CNNs have convolution layers in the network, meaning they have fewer connections than fully connected neural networks and generally need less computation power than fully connected neural networks [77]. CNNs are generally less prone to overfitting than fully connected neural networks when training data are limited. RNNs are different from fully connected neural networks and CNNs, as RNNs have memory of the input. RNNs perform the same function for every input data while the output of the current input depends on the previous computation. Long short-term memory (LSTM) is one kind of RNN and has been used in SSVEP signal classification.

Aritificial Neural Networks (ANNs)
ANNs are also known as feed forward neural networks, as they only have a forward direction that information will flow through the network with no turning back. ANNs have the advantage of generalized classification or predicting ability; with a properly designed structure, ANNs' performance enhances with subsequent training data. The disadvantage of ANNs is that it takes copious amounts of data to train an ANN, and this may not be viable in many data-insufficient areas such as BCIs. Additionally, there is no specific rule in the structure design of ANNs, which makes the design process more like trial and error, and also time consuming.
Of the 31 studies that used deep learning models to analyze SSVEP signals, only one study used an ANN model. In 2016, Kwak built three models to classify SSVEP signals, a CNN-1 with two convolutional hidden layers, a CNN-2 with two convolutional hidden layers and one fully connected hidden layer, and a fully connected neural network with two fully connected hidden layers. Kwak found that CNN-1 outperformed the other two, and the DNN model's performance was the worst of the three. Kwak deduced that the reason that CNN-1 outperformed all other methods was because of its low complexity with a simple structure, which was effective in his training-data-insufficient condition [17].

Recurrent Neural Networks (RNN)
RNN is known for its capability of storing temporary memory in the network, it has advantages in processing sequential information, especially in language translation, speech recognition, etc. The disadvantage of RNN is it is generally hard to train, both in time and in complexity.
LSTM is a type of RNN that has higher memory power and is thus able to learn long-term dependencies. Kobayashi first applied LSTM in 2019 to decode SSVEP signals in controlling drones and achieved an accuracy of 96.8%, which was significantly better than using FFT combined with machine learning methods such as decision tree (DT), support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbor nonparametric regression (k-NN) [46]. In 2022, Pan merged LSTM and a CNN together in his LSTM-CNN model and achieved the highest classification accuracies in two datasets. In Pan's LSTM-CNN model, a BiLSTM module was added after a one-dimensional convolution module used for temporal filtering [60]. Zhang proposed a bidirectional Siamese correlation analysis (bi-SiamCA) model that used two LSTM layers to extract features of the EEG signal and reference signal and then analyzed their correlation before feeding to a convolution layer [71].

Convolutional Neural Networks (CNNs)
CNNs are deep learning models that use convolutional layers and, in most cases, use pooling layers as well. The convolutional layers can extract features through convolutional kernels, and pooling layers can increase the observation field of the hidden layers. The advantage of CNNs is that they have weight sharing mechanisms, and thus the computation cost of CNNs is low compared to other deep learning models, and they can detect features without human intervention. The disadvantage of CNNs is that they usually require large amounts of data to train, and the complex structure of a CNN requires high computational power to train.
Nearly all of the studies that are included in this survey used convolutional layers in their model. This showed the effectiveness of convolutional layers in SSVEP analysis. One possible explanation of convolutional layers' popularity could be that convolutional layers take advantage of the local spatial coherence of SSVEP signals in the time domain or frequency domain, allowing the model to have lower weight and to be more easily trained in an SSVEP dataset. The structural design of CNN models can be seen in Table 2.

Number of Convolutional Layers
Some studies suggest that CNN models with more convolutional layers have better performance. Aznan found that although the shallow model with one convolutional layer worked well for subject S01, with an accuracy of 96 ± 2%, when the model was applied to subject S04, whose EEG data were absent from the training dataset, the classification accuracy dropped to 59%. However, by changing the convolutional layer number from one to five, the classification accuracy of subject S04 increased to 69%, which suggested that perhaps a deeper model is required to perform the inter-subject SSVEP classification [41]. Podmore built a CNN model with five hidden layers and achieved 86% offline accuracy of classification. In his experiment, his model was better than FBCCA when using only data from three channels, but worse than FBCCA when more channels were used. Additionally, Podmore demonstrated that his model had better performance than 1DSCU, which is a CNN model with only one hidden layer [47]. Zhao applied a CNN model with five hidden layers on the classification of AR-SSVEP and found it to be significantly more accurate than ensemble-TRCA, CCA, and FBCCA [65].
Some studies suggest that CNN models with fewer convolutional layers have better performance. Kwak implemented two kinds of CNN and one DNN neural network on the decoding of SSVEP signals. Kwak found that CNN-1, a convolutional neural network with two hidden layers, outperformed CNN-2, which had three hidden layers, and DNN, suggesting that more CNN layers may not be good for the model [17]. Based on these studies, it is observed that the models' performance is influenced by the number of convolutional layers, but there is not a linear relationship between the number of convolutional layers and the performance of the model.

Size of CNN Kernels
The kernels are used to convolve on the previous layer's output. A smaller kernel tends to collect more local information, while a larger kernel tends to collect more global information. In SSVEP analysis, most of the models use one-dimensional convolution, and thus the kernel size is 1×N. Here, the kernel sizes of CNN models are summarized in Figure 2B, and the details of the kernel sizes are shown in Table 2. From Figure 2B, it can be observed that small one-dimensional kernels of sizes below 1 × 25 are preferred in these studies.

Pooling Layer and Dropout
Both a pooling layer and dropout can reduce the computation cost of the model and overfitting. Max pooling layer or average pooling layer are often used after the convolution layer, and they help to reduce the spatial size of the convolved features as well as overfitting by providing an abstracted representation of the features. Dropout works by randomly zeroing some of the connections in the network, thus overfitting and some of the computation costs are reduced. Deep learning models often adopt pooling layers and dropout, especially in CNN models and LSTM models. Here, the implementation of pooling layers and dropout in deep learning models is summarized in Figure 3. From Figure 3, it can be observed that since 2021 more studies have chosen to use both pooling layers and the dropout technique to minimize overfitting in their models.
rain Sci. 2023, 13, x FOR PEER REVIEW 12 of 21 convolutional layers, but there is not a linear relationship between the number of convolutional layers and the performance of the model.

Size of CNN Kernels
The kernels are used to convolve on the previous layer's output. A smaller kernel tends to collect more local information, while a larger kernel tends to collect more global information. In SSVEP analysis, most of the models use one-dimensional convolution, and thus the kernel size is 1 × N. Here, the kernel sizes of CNN models are summarized in Figure 2B, and the details of the kernel sizes are shown in Table 2. From Figure 2B, it can be observed that small one-dimensional kernels of sizes below 1 × 25 are preferred in these studies.

Pooling Layer and Dropout
Both a pooling layer and dropout can reduce the computation cost of the model and overfitting. Max pooling layer or average pooling layer are often used after the convolution layer, and they help to reduce the spatial size of the convolved features as well as overfitting by providing an abstracted representation of the features. Dropout works by randomly zeroing some of the connections in the network, thus overfitting and some of the computation costs are reduced. Deep learning models often adopt pooling layers and dropout, especially in CNN models and LSTM models. Here, the implementation of pooling layers and dropout in deep learning models is summarized in Figure 3. From Figure  3, it can be observed that since 2021 more studies have chosen to use both pooling layers and the dropout technique to minimize overfitting in their models.  Table 2 (NM stands for not mentioned, GD stands for gradient descent, SGD stands for stochastic gradient descent with momentum, pooling is short for pooling layer).

Training Method
Gradient descent (GD) is the earliest method in deep learning to minimize the loss function of the model and optimize the model's weights. The disadvantage of GD is that  Table 2 (NM stands for not mentioned, GD stands for gradient descent, SGD stands for stochastic gradient descent with momentum, pooling is short for pooling layer).

Training Method
Gradient descent (GD) is the earliest method in deep learning to minimize the loss function of the model and optimize the model's weights. The disadvantage of GD is that it can easily be trapped at local minimal weights or saddle points instead of global minimal weights and can stop optimizing the model.
Stochastic gradient descent with momentum (SGD) implements the exponential moving average (EMA) to accumulate previous weight changes and have a better chance of escaping local minimal and saddle points. SGD with momentum has the disadvantage of not being able to adjust the step size to approach local minimal points in greater depth instead of oscillating between slopes. To solve this problem, root mean square propagation (RMSProp) adjusts the step size to avoid bouncing between ridges and move towards the minima.
Adam, which is short for adaptive moment optimization, combines the heuristics of both RMSProp and momentum. It is considered to be the optimum training method of deep learning models. From Figure 3, it can be observed that Adam is the most frequently used training method in deep learning research.

Activation Function
An activation function is used in the neuron of the deep learning model to add nonlinearity to the model and allows the model to abstract non-linear features from the input data. There are various types of activation functions, and they all have their advantages and disadvantages.
A sigmoid function mimics the probability value and gives a normalized output, which is easy to understand and is often used in shallow networks. The disadvantage of a sigmoid function is that it can cause the vanishing gradient problem, and its exponential calculation is slow for computers. Tanh provides stronger gradients than rectified linear unit (ReLU), and it has a zero-centered output, which facilitates back-propagation. The disadvantage of Tanh is that it also has the vanishing gradient problem just like the sigmoid function. ReLU makes the computation easier, and it can significantly improve the training speed of the deep learning model. ReLU also does not have the vanishing gradient problem. The disadvantage of ReLU is that when the input is negative, ReLU is inactive, thus it can generate dead neurons. The Gaussian error linear unit (GELU) was invented in 2016 and is very similar to ReLU; however, it was validated to provide improvement across computer vision, natural language processing, and speech tasks compared to ReLU [78]. The disadvantage of GeLU is that it has a complex computation. Parametric rectified linear unit (PReLU) is an improved version of ReLU, it has a small slope for negative values, and thus prevents the dying ReLU problem in which the ReLU neuron is stuck at the negative side and keeps outputting zero. PReLU is one of the most advanced activation functions in deep learning and appeared only once in our survey, in a paper published in 2022. The softmax function is used in almost every multivariate classification deep learning model as the output layer, as it turns the output vector into a vector that contains only positive numbers between 0 and 1, and it has an output sum of 1. This means that the output can be interpreted as probabilities.
The choice of activation function is arbitrary except for the output layer, and here the implementation of activation functions except for the softmax function is summarized in Figure 4. From Figure 4, it can be observed that ReLU is the most frequently implemented activation function.   Table 2 (PReLU stands for parametric rectified linear unit, GELU stands for Gaussian error linear unit, ELU stands for exponential linear unit, No means an activation function is not used in the model).

Discussion
As Table 2 shows, deep learning models are increasingly being employed to classify SSVEP signals with the progressive advancement of deep learning techniques. For a deep learning method that can successfully classify an SSVEP signal, the process of design and the performance measures are crucial.

Design of Model
The design of deep learning models in classifying SSVEP signals involves gaining SSVEP datasets, designing models, and enhancing model performance. In Section 2, the data length of 31 SSVEP deep learning studies is analyzed, and three commonly used open datasets are provided. This provides information for the researchers who are unsure of the data length that they should use to test their deep learning model.
Some researchers choose small datasets for their training and then use data augmentation to expand the dataset. Kobayashi's dataset contained only 400 s of SSVEP signal data. To expand the training data, Kobayashi split the 20 s of data into 923 segments with a 0.0195 s shift and a 2 s length. This expanded the 20 s of data into 1846 s of data for the model and allowed the model to be well trained [46]. Other data augmentation techniques such as SpecAugment have also been used to augment EEG data. SpecAugment was used initially in speech recognition and turned out to be effective in expanding SSVEP data [53].
Choosing an open dataset rather than self-collected data may be a better choice because many researchers have already published their models' performance based on open datasets, making it convenient to compare results. Additionally, this saves a lot of time in collecting data. Many researchers choose to use self-collected data together with public  Table 2 (PReLU stands for parametric rectified linear unit, GELU stands for Gaussian error linear unit, ELU stands for exponential linear unit, No means an activation function is not used in the model).

Discussion
As Table 2 shows, deep learning models are increasingly being employed to classify SSVEP signals with the progressive advancement of deep learning techniques. For a deep learning method that can successfully classify an SSVEP signal, the process of design and the performance measures are crucial.

Design of Model
The design of deep learning models in classifying SSVEP signals involves gaining SSVEP datasets, designing models, and enhancing model performance. In Section 2, the data length of 31 SSVEP deep learning studies is analyzed, and three commonly used open datasets are provided. This provides information for the researchers who are unsure of the data length that they should use to test their deep learning model.
Some researchers choose small datasets for their training and then use data augmentation to expand the dataset. Kobayashi's dataset contained only 400 s of SSVEP signal data. To expand the training data, Kobayashi split the 20 s of data into 923 segments with a 0.0195 s shift and a 2 s length. This expanded the 20 s of data into 1846 s of data for the model and allowed the model to be well trained [46]. Other data augmentation techniques such as SpecAugment have also been used to augment EEG data. SpecAugment was used initially in speech recognition and turned out to be effective in expanding SSVEP data [53].
Choosing an open dataset rather than self-collected data may be a better choice because many researchers have already published their models' performance based on open datasets, making it convenient to compare results. Additionally, this saves a lot of time in collecting data. Many researchers choose to use self-collected data together with public datasets or even multiple open datasets to validate their models' performance objectively, as shown in Table 2.
For the structural design of the deep learning model, CNN models are currently the most widely used models, and they will most likely perform well in future studies. CNN models' weight sharing feature and minimization of computation make them easier to be trained and efficient at extracting spatial features of data, especially when FFT is performed on the input data. However, the design of CNN models includes choosing many hyperparameters, such as the number of convolutional layers, the size of kernels, the activation function, the implementation of pooling layers or dropout, etc.
In this survey, detailed structures of 26 uniquely designed CNN models are shown in Table 2, which provides information for researchers who want to design their own CNN models. Additionally, a general structure of a CNN model can be observed: an input layer consisting of channels × time points or FFT data points; two to three convolutional layers with pooling layers or dropout; a fully connected layer between the last convolutional layer and output layer; and an output layer which contains the same number of neurons as the number of stimuli.
In the choice of an activation function, a popular choice would be to use the ReLU function as the activation function in hidden layers and the softmax function as the output function. However, with the advent of GeLU and PReLU which prove to be better substitutes of ReLU the choice of GeLU and PReLU should be considered as a promising alternative.
When tuning hyperparameters for the model, optimizing algorithms can also be used. Bhuvanesshwari proposed an automated hyperparameter optimization technique using the Red Fox Optimization Algorithm (RFO) and compared its result with the results of four other optimization algorithms and found that the hyperparameters of a five-layer CNN optimized by RFO has the highest classification accuracy of 88.91% [68].
For the training of the deep learning models, Adam combines the advantages of RMSprop and momentum and is generally the best choice. This can be observed in the mass usage of Adam since 2018, as shown in Table 2.
Other than designing CNN models from scratch, some researchers choose to modify existing CNN models used in computer vision to classify the SSVEP signal. Avci converted an SSVEP signal into a spectrogram and routed it to GoogLeNet deep learning model for binary classification [66]. Paula encoded EEG data to images using time-series imaging techniques and then used four 2D-kernel-based CNNs in the computer vision field, including ResNet, GoogLeNet, DenseNet and AlexNet, to classify the SSVEP signal [75].
EEGNet, a compact convolutional neural network initially designed for classifying multiple BCI paradigms including P300 visual-evoked potential, error-related negativity responses (ERN), movement-related cortical potential (MRCP), and sensor motor rhythms (SMRs), has also been widely utilized as a basic module in CNN models. In Yao's research, three EEGNets were used as sub-networks in his CNN model [73]. Likewise, Li modified EEGNet and applied transfer learning to initially train the model parameters [63]. Zhu applied an ensemble learning strategy to combine multiple EEGNet models with different kernel numbers together to enhance the classification accuracy of ear-EEG signals from 50.61% to 81.12% at a 1 s window length of the EEG signal. Zhu also demonstrated that the classification accuracy of the average ensemble model surpasses the accuracy of a single EEGNet model with different kernel numbers [54]. These studies show that EEGNet is an effective building block in CNN model design.
Schirrmeister showed that convolutional neural network design choices substantially affect decoding accuracies, especially that the implementation of batch normalization and dropout significantly increase accuracies, which also shows that recent advances in deep learning methods improve the performance of the model [22]. Thus, adding newly developed deep learning technology may be an effective way of enhancing model performance.

Performance Measure of the Model
Most of the studies chose accuracy as the measure of their model's performance. The formula for calculating accuracy is shown below: where P is the prediction accuracy, N_correct is the total number of correct predictions in the experiment, and N_total is the total number of predictions in the experiment. The comparison between the newly developed model's accuracy with existing methods' accuracies based on the same dataset is valid, but the comparison of accuracy values across different studies based on different datasets is not valid. This is because the classification accuracy also depends on the number of stimuli, as more stimuli means lower probability in choosing the right stimulus by chance.
The accuracy also depends on whether the detection is inter-subject or intra-subject. Intra-subject detection is also known as user-dependent (UD) detection. In this case, the model is trained using the data of one single participant and validated on the same participant. Inter-subject detection is also known as user-independent (UI) detection, in which the model is trained using the data of multiple participants and validated on the novel unseen user's data. Ravi demonstrated that UD-based training methods consistently outperformed UI methods when all other conditions were the same [51].
Another commonly used metric is information transfer rate (ITR), which measures the communication speed and quality of the BCI system. ITR is calculated by the following formula with units of bits/min, i.e., where P is the prediction accuracy that lies between 0 and 1, M is the number of stimuli, and T is the stimulation duration in seconds. For the same model, P can be improved by using more sampled data points in each classification, thus making T longer, and causing the transmitting efficiency to fall. Shorter T will limit the data in each classification, causing P to fall. There is a trade-off in the optimization of ITR.

Limitations of This Survey
This survey aims at analyzing deep learning models used in SSVEP classification, and it does not cover deep learning models used for other brain signals such as P300, motor imagery (MI), etc. Although these brain signals are different to SSVEP, the deep learning models applied for these brain signals may be instructive to the designing of deep learning models for SSVEP classification. Additionally, recent advancements in other fields such as computer vision and language processing can aid the deep learning models used for SSVEP classification, which are not included in this survey.

Opening Challenges and Future Directions
Most of the studies on SSVEP signal classification are based on multi-channel data, as more data often means more potential features for the deep learning model. However, in real applications, wearing multi-channels EEG amplifiers is inconvenient and expensive.
EEG devices using a lower number of electrodes ultimately translate to lower: (1) hardware costs, (2) hygiene risks, and (3) user discomfort [79]. Research on using one or a few channels of SSVEP data is meaningful. In 2022, Macias modified capsule neural networks (CapsNet) to classify SSVEP signals and achieved a classification accuracy of 98.02% on his own dataset using a single active channel [70].
Ear channels are also a good substitute for SSVEP signal detection rather than scalp channels in terms of convenience, unobtrusiveness, and mobility. In 2022, Israsena proposed a CNN structure with two convolutional layers to classify SSVEP signals from one scalp channel and two ear channels, T7 and T8. Israsena achieved 79.03% accuracy with a 5 s window from Oz and around 40% accuracy from T7 or T8 [69]. The accuracies were not high, and there is still room for improvement in detecting SSVEP signals from ear channels.
Most of the deep learning models use one-dimensional CNN kernels to extract spatial features of SSVEP signals in the time domain or in the frequency domain, which regards SSVEP signals as one-dimensional data recorded in multiple channels. This prevents the implementation of deep learning models in the computer vision area. However, in 2022, Avci demonstrated that by converting SSVEP signals into spectrograms, deep learning models in computer vision can be applied to SSVEP signal classification [66]. Avci's work is inspirational and hopefully, in the future, by changing SSVEP signals into two-dimensional graph data, more models in the computer vision area can be implemented in SSVEP classification and demonstrated to be effective.
In 2015, the filter bank was proposed and used with CCA to improve the performance of CCA in classifying SSVEP. Multiple studies in this survey implemented filter banks to improve their deep learning models' performance. There are other data preprocessing techniques that can be applied to deep learning models to enhance their performance as well. Future research can be conducted to study these techniques.
In summary, here are three future directions that researchers should pay attention to: 1.
Using deep learning models to enhance the performance of SSVEP classification based on data from fewer or single channels, or ear channels to improve the SSVEP-based BCI's practicality; 2.
Implementing the latest deep learning models or techniques for the classification of SSVEP signals; 3.
Trying different data preprocessing techniques to enhance deep learning models' performance.

Conclusions
In this survey, 31 deep learning models in SSVEP-based BCI were examined in detail and analyzed. There are three key aspects to consider in the design of deep learning models, including the model input, model structure, and model performance measures. In the model input section, the data length is analyzed to provide a reference for the amount of data needed to train deep learning models in SSVEP classification. Then, three frequently used open datasets are presented. Frequently used data preprocessing methods using deep learning models including filters, FFT, and filter banks are also introduced. In the model structure section, different structures of deep learning models are analyzed as well as their basic components, such as activation function, kernel size, layer number, and training method. This provides information for the structural design of deep learning models. In the discussion section, the design and performance measures of deep learning models are discussed. In Section 5, current challenges and future directions are pointed out. More importantly, the design details of 31 deep learning models are summarized in Table 2 to offer a convenient and comprehensive reference for designing future deep learning models.

Conflicts of Interest:
The authors declare no conflict of interest.