Dataset Evaluation Method and Application for Performance Testing of SSVEP-BCI Decoding Algorithm

Steady-state visual evoked potential (SSVEP)-based brain–computer interface (BCI) systems have been extensively researched over the past two decades, and multiple sets of standard datasets have been published and widely used. However, there are differences in sample distribution and collection equipment across different datasets, and there is a lack of a unified evaluation method. Most new SSVEP decoding algorithms are tested based on self-collected data or offline performance verification using one or two previous datasets, which can lead to performance differences when used in actual application scenarios. To address these issues, this paper proposed a SSVEP dataset evaluation method and analyzed six datasets with frequency and phase modulation paradigms to form an SSVEP algorithm evaluation dataset system. Finally, based on the above datasets, performance tests were carried out on the four existing SSVEP decoding algorithms. The findings reveal that the performance of the same algorithm varies significantly when tested on diverse datasets. Substantial performance variations were observed among subjects, ranging from the best-performing to the worst-performing. The above results demonstrate that the SSVEP dataset evaluation method can integrate six datasets to form a SSVEP algorithm performance testing dataset system. This system can test and verify the SSVEP decoding algorithm from different perspectives such as different subjects, different environments, and different equipment, which is helpful for the research of new SSVEP decoding algorithms and has significant reference value for other BCI application fields.


Introduction
Brain-computer interface is a direct interaction, communication, and control system established between the brain and external devices without relying on peripheral nerves and muscle tissue [1][2][3]. The SSVEP paradigm is a classic brain-computer interface paradigm that has been extensively studied for over 20 years [4][5][6][7] due to its high signal-to-noise ratio, stable response, and high information transfer rate (ITR). Decoding algorithms are a crucial component of brain-computer interface research, as they process and analyze brain signals, and convert them into instructions that can be understood by external devices. Data required for algorithm research are typically self-collected by researchers recruiting subjects or obtained from public datasets published in the field. Thus, the publication of public datasets with standardized collection specifications and detailed descriptions is crucial for effectively verifying algorithm performance and promoting iterative algorithm progress. In recent years, several classic datasets in the SSVEP-BCI field have been released [8][9][10][11][12][13], and have been widely applied and verified, which effectively promoting the development of the SSVEP-BCI field.
However, the majority of current algorithmic research utilizes one or two datasets to verify their performance [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29], which did not make full use of public data resources, and the results were limited by the distribution of data samples in individual datasets, so it was not conducive to judge the application effect of the algorithm in the actual scene through the result. This issue has two underlying causes. Firstly, existing datasets in the field lack unified organization and evaluation, with some datasets yet to gain widespread attention. Secondly, there is a lack of unified analysis and evaluation methods for multiple datasets in the field.
In order to solve the above problems, this paper sorted out six sets of frequency and phase modulation paradigms datasets [8][9][10][11][12]30] that were publicly available in the field of SSVEP-BCI. This paper further proposed an SSVEP dataset evaluation method and analyzed the above six datasets, forming a dataset system for algorithm performance testing that includes datasets with different sample population distributions, devices, and acquisition environments (as shown in Figure 1). Based on this dataset system, comprehensive testing of the SSVEP-BCI decoding algorithm performance was conducted, ultimately achieving an efficient simulation of the actual application effect of the decoding algorithm and avoiding significant differences between the actual BCI system application performance and the data validation results caused by independent dataset validation.

Datasets
The current research on decoding algorithms in the BCI field often corresponds to the paradigm. To ensure horizontal comparability between the dataset and the decoding algorithm, this paper selected the most influential frequency and phase modulation paradigm in the SSVEP field [31] (the existing public SSVEP datasets mainly adopt this paradigm), and then sorted out all the six public datasets of the frequency and phase modulation paradigm that can be obtained currently, as shown in Table 1. SSVEP benchmark dataset (dataset1). The SSVEP benchmark dataset was collected and published by Wang et al. [8], containing EEG data from 35 subjects performing a 40-target frequency and phase modulation SSVEP typing task. Each stimulus target includes six trials, with each trial lasting 5 s. For more detailed information about the dataset1, please refer to reference [8]. This dataset will be referred to as dataset1 in this paper.
SSVEP BETA dataset (dataset2). The SSVEP BETA dataset [9] serves as an extension to the SSVEP benchmark dataset, with a different environment and subject group. Liu et al. [9] collected and released this dataset, which includes 70 participants and 4 online experiments. Each experiment consists of 40 trials, beginning with a 0.5 s cue period, followed by the flicker task period, and ending with a 0.5 s rest period. For the first 15 participants (S1-S15) in the dataset, the flicker task period lasts at least 2 s, while for the remaining 55 participants (S16-S70), it lasts at least 3 s. For more detailed information about dataset2, please refer to reference [9]. This dataset will be referred to as dataset2 in this paper.
SSVEP Wearable dataset (dataset3). The SSVEP wearable dataset [10] is currently the largest and most standardized wearable BCI dataset available within the SSVEP-BCI field. It contains experimental data from 102 participants, which was collected and released by Zhu et al. [10].
This dataset includes two collection methods of dry electrode and wet electrode, and each electrode collects 10 blocks of data, respectively. Each block consists of 10 trials per stimulus target, with a total duration of 2.84 s per trial, comprising a 0.5 s cue period, a 2 s task period, as well as an additional 0.14 s visual latency, and a 0.2 s visual aftereffect period. For more detailed information on this dataset, please refer to the original publication [10]. Since this dataset includes data from both dry and wet electrodes, we will refer to them as dataset3a (dry electrode data) and dataset3b (wet electrode data), respectively in the following sections.
SSVEP elder dataset (dataset4). The SSVEP elder dataset [11] contains experimental data from 100 elderly participants and was collected and released by Liu et al. [11]. This dataset employs a nine targets SSVEP paradigm and is divided into six blocks. Each block consists of one trial for each stimulus target, with a total duration of 6 s per trial, including a 0.5 s pre-stimulus period, 5 s of stimulus presentation, and a 0.5 s post-stimulus period. For more detailed information on this dataset, please refer to the original publication [11]. This dataset will be referred to as dataset4 in this paper.
SSVEP FBCCA-DW dataset (dataset5). The FBCCA-DW dataset was collected by Yang et al. [30], including 14 subjects performing a 40-target SSVEP paradigm (same as dataset1). The experiment was divided into four blocks, with each block consisting of one trial for each stimulus target, and each trial lasting for 3 s. For more detailed information on this dataset, please refer to the original publication [30]. This dataset will be referred to as dataset5 in this paper.
SSVEP USCD dataset (dataset6). The USCD dataset was collected and published by Nakanishi et al. [12] and includes data from 10 healthy subjects. The dataset includes 12 stimulus targets, divided into 15 blocks with each block collecting one trial of each target. Each trial had a stimulus duration of 4 s. A detailed introduction to the dataset can be found in reference [12]. This dataset will be referred to as dataset6 in this paper.

Decoding Algorithm
This study tested the popular four decoding algorithms in the SSVEP-BCI field including canonical correlation analysis (CCA) [4], filter bank canonical correlation analysis (FBCCA) [32], ensemble task-related component analysis (eTRCA) [7], and task discriminant component analysis (TDCA) [33]. CCA, FBCCA, and eTRCA have all been considered as significant milestones in the development of SSVEP-BCI decoding algorithms and have led to many subsequent improvements. This study proposed a dataset system for evaluating SSVEP decoding algorithms and, therefore, focuses on the three most influential algorithms mentioned above. Furthermore, since the previously mentioned three algorithms were proposed several years ago (with the latest eTRCA algorithm being proposed in 2017), this study introduced a newly proposed eTRCA algorithm improvement called TDCA for validation. TDCA is also a widely recognized training-based decoding algorithm in the field.
Furthermore, among the four aforementioned algorithms, CCA and FBCCA are both non-training algorithms, while eTRCA and TDCA are training algorithms. In this study, eTRCA and TDCA algorithms were trained on data consisting of six trials for each task target (stimulation frequency) using cross-validation for trial selection. A filter bank consisting of five filters was used, and its design and weight curves were kept consistent with the reference [7]. In addition, the number of components in the TDCA algorithm was selected as nine after tuning (increasing it further did not improve performance but only added to the algorithm's complexity). Another important parameter of TDCA is Np, which refers to the length of the sliding window. Typically, the data window length used for calculation needs to be greater than 1 s, so that Np can range from 0.1 s to 1 s (in intervals of 0.1 s). In this study, when the data window length was less than 1 s, the corresponding Np parameter was set as the maximum value supported by that window length. For instance, if the window length was 0.5 s, Np would take values from 0.1 s to 0.5 s (in intervals of 0.1 s). For more detailed information regarding the principles of the four decoding algorithms mentioned above, please refer to the respective references.

Evaluation Indexes
The core feature of SSVEP is the EEG frequency response corresponding to the stimulation frequency [34]. Therefore, the frequency signal-to-noise ratio is the most direct indicator for evaluating the strength of the SSVEP-induced response, and it has two analysis methods: narrowband SNR and broadband SNR. In addition, the accuracy of recognition, optimal response time, and ITR are the core indicators of a broad consensus on the performance evaluation of the SSVEP-BCI algorithm [7,32]. The six analyzed datasets in this study possess varying paradigm coding parameters, collection equipment, and test groups. Consequently, the results of the aforementioned five indexes typically vary significantly among different samples [9]. To summarize, this paper proposed five indexes for data evaluation, including the narrow-band signal-to-noise ratio (SNR), wide-band SNR, accuracy of recognition, optimal response time, and information transfer rate (ITR).
In addition, since different algorithms yield different results for accuracy of recognition, optimal response time, and ITR, this study needs to select a widely accepted and influential decoding algorithm in the field as a standard algorithm. The FBCCA algorithm [32] is a milestone in the SSVEP-BCI field for its non-training decoding algorithm, which first introduced the concept of spatial filter banks and greatly improved the effectiveness of SSVEP decoding. Therefore, FBCCA is widely used in the development of BCI systems and is considered a highly influential and stable algorithm. Thus, the FBCCA was used as the standard algorithm for data analysis in this study.
The following section provides the specific definitions and calculation methods for the five indexes: (1) The narrow-band SNR The narrow-band SNR is defined as the ratio of the spectral amplitude of the stimulation frequency to its surrounding frequencies and is represented as SNRt. The calculation method is shown in Equation (1).
Here, y( f ) refers to the signal amplitude spectrum calculated by fast Fourier transform, and df represents the spectral resolution. The SNRt in this context is defined as the ratio of the amplitude spectrum value of y( f ) to that of the frequencies 1 Hz apart on either side, and the unit is decibel (dB). During the calculation process, EEG data from the same frequency trials (usually 4-6 trials, depending on the data collection of different datasets) were first overlaid and averaged before calculating SNRt.
(2) Wide-band SNR The wide-band SNR is defined as the ratio of the spectral amplitude of the stimulation frequency and its harmonic responses to the amplitude of other frequencies in a wideband range. This variable is represented as SNRw (as shown in Equation (2)).
Nh denotes the number of harmonics, P(f ) refers to the power spectrum at frequency f, and fs/2 represents the Nyquist frequency.
(3) Accuracy of recognition using standard algorithm The accuracy of multi-target recognition under FBCCA is referred to as ACC stand . The data window length used by the FBCCA algorithm is 2 s. If the window length of the dataset is less than 2 s, the maximum window length of the dataset is taken.

(4) Optimal response time of standard algorithm
Response time is an important index of BCI performance that can be matched to standard algorithms in the field for data analysis and discrimination with specific paradigms and scenarios. In this paper, the optimal response time for the multiple target recognition of data calculated by FBCCA is represented by T best (the time coordinate at the inflection points where the recognition accuracy is greater than 90%). If the maximum data window length has not yet reached an average recognition accuracy of 90%, then the optimal response time will be adjusted based on the accuracy obtained under the condition of the maximum data window length in the dataset. The specific calculation method is shown in Equation (3).
where T max represents the maximum data window length supported by the dataset, and ACC T max represents the average recognition accuracy of the FBCCA algorithm under the condition of T max data window length.
(5) Optimal information transfer rate of standard algorithm Information transfer rate (ITR) is an evaluation index that comprehensively considers recognition accuracy, coding target number, and response time. The calculation of ITR is shown in Equation (4): T represents the calculation response window length, N represents the number of stimulation targets, and P represents the recognition accuracy. T is an important influencing factor in the calculation of ITR. When calculating ITR data in this research, the response time window length T was calculated as the recognition window length plus 0.5 s. This additional 0.5 s was considered to be the switching time that occurred when users shifted their gaze between targets in actual use scenarios [7,31]. Recognition accuracy is the most critical measure of BCI system performance as it determines the system usability. Therefore, the ITR best in this research is defined as the highest ITR score obtained under a response window length T corresponding to recognition rates exceeding 90%. If using any length of time window cannot achieve a recognition accuracy of over 90%, the calculation of the ITR best is based on the maximum time window length supported by the dataset.

Index Scoring Method
Regarding the five evaluation indexes (SNRt, SNRw, ACC stand , T best , ITR best ) introduced in the previous section, this study designed a percentage-based scoring method. The corresponding scoring calculation methods (score1-score5) were provided, and the total score was obtained by summing them up. However, score1-score5 were not designed to be 20 points for each item (the average of five indicators in percentage), but were slightly adjusted based on the importance of each indicator. Specifically, the corresponding total score of SNRw and T best were reduced to 15 points, while that of ACC stand and ITR best were increased to 25 points. The reasons for the specific adjustments are as follows: Although SNRw is an indicator that measures the strength of SSVEP response, the wideband noise can be reduced by filtering in the frequency or spatial domain during algorithm development (while narrowband noise is difficult to filter out). Therefore, score1 is not adjusted, while score2 is slightly reduced. Although T best is a crucial indicator that reflects the performance of the BCI system, the evaluation of the BCI system currently focuses on the accuracy and the information transfer rate [7,18,35]. Therefore, score4 was appropriately reduced, while score3 and score5 were appropriately increased.
In addition, this study introduced the log10 function to adjust the score change curve, aiming to compress the parameter range of the lower scores appropriately so that more complex data under different conditions can obtain some scores (e.g., dataset3b). The subsequent section will outline the specific calculation methods of score1-score5.
(1) Score of SNRt The response SNRt distribution of SSVEP in the occipital region is typically in the range of −5 dB to 20 dB [9], thus a corresponding data score 'score1' can be given as shown in Equation (5).
The value 15.1 is a parameter that ensures that score1 approximates 20 when SNRt reaches 10 dB, and it does not hold any physical significance. Similar parameter settings can be found in Equations (6)-(9) in the subsequent section.
(2) Score of SNRw The response SNRw distribution of SSVEP in the occipital region is typically in the range of −35 dB to 0 dB [9], thus a corresponding data score 'score2' can be given as shown in Equation (6).
(3) Score of the ACC stand Using the FBCCA algorithm [32] to compute the data, the accuracy of multiple target identification is referred to as ACC stand . In consideration of system performance and availability, a corresponding score can be assigned to represent data evaluation, denoted as score3 and shown in Equation (7).
(4) Score of the T best The calculation method of score4 (score of T best ) is presented in Equation (8). If the recognition accuracy can achieve 90% within 2 s data length, full marks will be awarded for this item. If the T best falls between 2 and 8 s, a corresponding score will be assigned based on the number of targets (the C in Equation (8)) and T best . If T best exceeds 8 s, indicating low system availability, the score for this item will be zero.
(5) Score of the ITR best When using the FBCCA algorithm to compute the classic dataset in this paper, the obtained ITR best falls within the range of 30 bits/min to 100 bits/min. Hence, the score calculation for ITR best (score5) is given by Equation (9).
(6) Total score This study established the data range of relevant indexes and provided corresponding score assessments based on the analysis and judgment of SSVEP datasets. By summing up the scores of five indexes and referring to Table 2, the level of SSVEP datasets can be determined. Table 2 shows that the higher the final score, the lower the decoding difficulty of the dataset.

Algorithm Performance Evaluation Index
To evaluate algorithm performance, this study used recognition accuracy, best response time, and ITR as indices. The definitions and computation methods of these indices have been introduced in the dataset evaluation method section. Table 3 presents the calculation results for the six datasets based on the five SSVEP dataset evaluation indexes and six score (score1-score5 and total score) proposed in this paper.  Table 3 indicates that there are differences in the evaluation scores of six datasets in terms of SNRt, SNRw, ACC stand , T best , and ITR best . Dataset3a, which uses wearable SSVEP data collected by dry electrode headbands has the highest decoding difficulty, while dataset1, which strictly controls subject age, state, and collection environment has the lowest decoding difficulty due to its outstanding decoding performance. To summarize, the aforementioned six datasets establish a system for verifying algorithm performance with a vertically distributed decoding difficulty rating.

Algorithm Performance Testing
Comprehensive testing and verification of the performance of CCA, FBCCA, eTRCA, and TDCA algorithms were conducted using the six datasets analyzed in this paper, with the results shown in Figure 2, Tables 4 and 5. Figure 2 examines the performance of the same algorithm on different datasets, while Tables 4 and 5 examine the performance of different algorithms on the same dataset.  CCA vs. FBCCA < *** < *** N.S. N.S. < *** N.S. N.S. CCA vs. eTRCA < *** < *** < ** < *** < *** N.S. < *** CCA vs. TDCA < *** < *** < *** < *** < *** N.S. < *** FBCCA vs. eTRCA < *** < *** < *** < *** < *** N.S. < *** FBCCA vs. TDCA < *** < *** < *** < *** < *** N.S. < *** eTRCA vs. In Figure 2, the data lengths refer to the duration of the extracted EEG data starting from the onset of the stimulus trial, representing the length of EEG data used for algorithm recognition. From Figure 2, it can be observed that the same algorithm exhibited differences in recognition accuracy when applied to different datasets. Furthermore, these differences vary based on different data window length conditions. For instance, in the case of the CCA algorithm, the recognition result for dataset1 is the lowest (among the six datasets) with a 0.4-s time window length. However, when the window length is extended to 1 s or longer, there is a significant improvement in the recognition result for dataset1. This also indicates that the six datasets mentioned above do exhibit a vertical distribution decoding difficulty and can be comprehensively tested and evaluated for algorithm performance in different subject populations, collection devices, and collection environments.
Using a window length of 1 s, a comparison study between any two methods was conducted using the paired sample t-test. The corresponding p-value after the Bonferroni correction is listed in Table 5.
Tables 4 and 5 demonstrate that the accuracy calculation results of eTRCA and TDCA algorithms are higher than those of CCA and FBCCA algorithms, and the performance of TDCA is better than that of eTRCA, while the performance of FBCCA is better than that of CCA. However, there are exceptions, such as when using dataset3 with high decoding difficulty, the computational performance of FBCCA and CCA is similar. When using dataset6, the computational performance of TDCA and eTRCA is comparable. These results indicate that the performance of algorithms under different environmental and demographic conditions is not fixed.
Furthermore, after Bonferroni correction, there were no significant differences in the calculation results between eTRCA and TDCA algorithms. However, the independent T-tests between the two algorithms showed significant differences in some datasets (Dataset1: p < 0.05; Dataset2: p < 0.001; Dataseta3a: p < 0.001; Dataset3b: p < 0.001; Dataset4: p < 0.001; Dataset5: p = 0.09; Dataset6: p = 0.67). Moreover, although the statistical results calculated for dataset5 after Bonferroni correction were non-significant, the T-test results between these four algorithms without Bonferroni correction showed significant differences (except for the p-value between eTRCA and TDCA).
Additionally, the decoding performance of specific datasets is not fixed. Taking dataset1 as an example, it can be observed that the decoding performance of this dataset is significantly worse when using non-training algorithms (CCA and FBCCA) compared to dataset4 and dataset5. However, the opposite results were obtained when using training algorithms (eTRCA and TDCA). These results suggest that different algorithms perform differently on the gradients of the six datasets presented in terms of decoding difficulty. Therefore, the proposed decoding difficulty gradient distribution in this paper may facilitate comprehensive performance testing of different algorithms.

Result of Typical Subjects
There are large individual differences in the brain response between different people. Therefore, when analyzing the performance of the decoding algorithm, we should not only focus on the mean value of the calculated results but also pay attention to the lowest and highest values. The highest value indicates the performance upper limit that the algorithm can achieve. For example, if the corresponding BCI system is applied in specific operational scenarios [36,37], selecting individuals with excellent performance from a large pool of operators can achieve the best system performance. The minimum value indicates the lower limit of algorithmic performance, which should be improved in mass usage scenarios [3,38,39] to enable a wider range of users to use it. Table 6 shows the statistical results of the best and worst subjects for the four algorithms analyzed on six datasets under the condition of a 2 s data window length. Table 6 shows that there is a large difference in accuracy and ITR between the best and worst performance subjects for the same algorithm and dataset. Additionally, for the same dataset, the best and worst subjects for different algorithms are not entirely identical. This result objectively reflects the individual variability of SSVEP response. Furthermore, the worst performing subjects did not achieve an identification accuracy above 80% for the 2 s data window length (except for eTRCA and TDCA on dataset1 and dataset6), while the accuracy of the best performance subjects was all exceeding 90%. These results suggest that further optimization and improvement are needed for the SSVEP-BCI developed based on the above four algorithms to enhance the adaptability of acquisition equipment and environment, as well as expand the applicable population.

Advantages of the Dataset System over Individual Datasets
The dataset system proposed in this paper is valuable because it can verify the performance of SSVEP-BCI decoding algorithms under different conditions, including different subject populations, acquisition devices, acquisition environments, and decoding difficulties. Figure 3 summarizes the results from Figure 2, Tables 4 and 5, indicating that the performance of the same decoding algorithm varies significantly across different datasets, and not all calculated results meet the performance requirements for BCI systems in actual scenarios.
Therefore, if only 1-2 datasets are used in the decoding algorithm research process, the algorithm's performance may fluctuate significantly in actual application scenarios. It is recommended to use multiple datasets to test and evaluate decoding algorithms to ensure their reliability and effectiveness in various real-world scenarios.
Further analysis Tables 4 and 5, taking the most widely used dataset (dataset1) as an example, shows that eTRCA achieves an identification accuracy of 94.3 ± 10.9% in a 1 s window length, while TDCA achieves an accuracy of 96.5 ± 6.1% and FBCCA achieves an identification accuracy of 90.2 ± 13.3% for a 2 s window length. Without further validation on other datasets, these accuracy results suggest that FBCCA, eTRCA, and TDCA are expected to meet the needs of actual BCI use. However, in reality, the actual performance of these algorithms can vary significantly when the acquisition environment, participant group, and electrode materials change. For instance, when using dataset2 to simulate real application environments with a more open acquisition environment and a wider range of participant groups, eTRCA achieves only 75.5 ± 22.0% accuracy with a 1 s window length, while TDCA achieves an accuracy of 82.5 ± 16.6% with a 1 s window length, and FBCCA achieves an accuracy of 81.8 ± 15.7% with a 2 s window length. Moreover, when using dry electrode data in dataset3a, the accuracy of these algorithms will decrease to 58.1 ± 31.0% (eTRCA using 1 s window length), 79.4 ± 21.7 (TDCA using 1-s window length), and 59.3 ± 27.3% (FBCCA using a 2 s window length). These identification accuracy results are difficult to meet the practical requirements of BCI, and the difference between them and the high-performance results obtained under strictly controlled conditions (dataset1) is significant. Thus, it is highly recommended to use multiple datasets with a gradient distribution of the decoding difficulty level to test the performance of decoding algorithms in SSVEP-BCI.
Furthermore, the dataset system constructed in this study can be expanded in the future to include more high-value datasets that reflect realistic BCI scenarios, as shown in Figure 3. This will enhance the accuracy and applicability of decoding algorithms in practical BCI systems.

Value of Dataset Decoding Difficulty Assessment
This study evaluated six datasets of the frequency and phase modulation SSVEP paradigm (as shown in Table 3). The six datasets covered decoding difficulty ratings of A, B, C, D, and E, and the rating results corresponded to the results in Figure 2 and Table 4, indicating that the performance indexes calculated by different decoding algorithms were relatively lower for datasets with higher decoding difficulty. Therefore, the dataset decoding difficulty assessment method can be used to construct a dataset system with differentiated decoding difficulty levels to achieve comprehensive testing of decoding algorithm performance. Furthermore, it can provide a reference for algorithm performance testing and improvement. For example, although CCA, FBCCA, eTRCA, and TDCA performed poorly on dataset3a, the decoding difficulty rating of E for this dataset confirms the rationality of the results and provides ideas for subsequent improvements to the decoding algorithm (i.e., studying the signal characteristics of dataset3a and focusing on improving algorithm performance based on dataset3a).

Current Situation of Dataset Usage in SSVEP Decoding Algorithm Research
This study conducted a further analysis of the current state of dataset usage in the latest research papers on SSVEP decoding algorithms. The selection of papers followed specific criteria, including using the Web of Science Core Collection with the keywords "SSVEP & algorithm" as the primary topic, limiting the document type to papers, and restricting the search results to papers published before April 2023. In addition, we manually excluded studies that were not related to decoding algorithms such as research on BCI system applications involving algorithm validation and usage, algorithm review papers, and dataset papers. Subsequently, we sorted the search results chronologically and selected the 20 most recently published papers and analysis of their dataset usage, which is presented in Table 7.  [9] Yes 70 SSVEP Wearable [10] Yes 102 Table 7 shows that the majority of the 20 SSVEP decoding algorithm papers analyzed used more than one dataset, with only three papers [18,20,25] using self-collected data verification methods. Among the 20 papers, 40% (8/20) used only one dataset, while 80% (16/20) used no more than two datasets, and the remaining four papers used only three datasets for verification. Therefore, there is significant room for expansion in dataset usage in current research on SSVEP decoding algorithms.
Regarding dataset usage, it can be seen that in the 20 papers of Table 7, SSVEP Benchmark [8] (dataset1) was used 11 times, SSVEP BETA [9] (dataset2) was used 5 times, SSVEP UCSD [12] (dataset6) was used 4 times, and SSVEP Wearable [10] (dataset3) was used twice (this dataset was released in 2021). Other datasets were used no more than twice. It can be seen that there are significant differences in the frequency of use of different datasets. From the perspective of the number of subjects, it can be observed that the use of previous datasets for algorithm verification involves a much larger number of subjects than using self-collected data. Therefore, releasing SSVEP datasets can promote algorithm research. It is hoped that research teams with capabilities in this field can collect more self-generated validation data and make them public on the basis of using existing datasets.
In summary, the dataset evaluation method and algorithm testing dataset system proposed in this study can provide more references for future algorithm research. The decoding algorithm research can attempt performance testing and verification on more newly shared datasets (such as dataset2, dataset3, and dataset4).

Subsequent Extensions of This Study
This paper proposed a decoding difficulty evaluation method for SSVEP datasets. Among them, ACC stand , T best , and ITR best are all based on the calculation results of the most classic non-training algorithm FBCCA. However, during the actual analysis process (Figure 2, Tables 4 and 5), it can be found that the ACC stand , T best , and ITR best of different decoding algorithms are different. Therefore, in future studies, a variety of decoding algorithms can be explored and used to comprehensively evaluate the decoding difficulty of the datasets.
Furthermore, this study analyzed four representative SSVEP decoding algorithms that have the potential for further testing and expansion to include other algorithms. Additionally, there is still room for expansion of the six frequency and phase modulation SSVEP datasets compiled in this study. Future studies can attempt to expand the analysis to include datasets from SSVEP paradigms that are not entirely identical (For example, combining the frequency and phase modulation paradigm with the dual-frequency modulation SSVEP paradigm [42]). This can promote the development of performance improvements for SSVEP-BCI general decoding algorithms (e.g., eTRCA can be used for various SSVEP paradigms). Further attempts can also be made to conduct comprehensive analyses of visual BCI paradigms that are similar to SSVEP paradigms (such as the C-VEP paradigm [51]) to promote research on general decoding algorithms for visual BCI. Other BCI paradigms (such as the motor imagery paradigm [52], P300 paradigm [41], etc.) can also build corresponding dataset systems based on their own data characteristics.

Conclusions
In summary, this paper summarized six open-source datasets of the frequency and phase modulation paradigm of SSVEP-BCI. Additionally, this paper proposed a dataset decoding difficulty evaluation method and integrated the above six datasets to form a set of SSVEP algorithm performance testing dataset systems. Finally, based on the dataset system, the existing four classic algorithms (CCA, FBCCA, eTRCA, and TDCA) were tested and verified, and significant differences were found in the calculation results of different algorithms on different datasets. This demonstrates the importance of utilizing an algorithm evaluation dataset system that includes different subject groups, scenarios, and collection devices for new SSVEP decoding algorithm research. Furthermore, this work may serve as a reference for dataset evaluation and algorithm performance testing of other BCI paradigms.

Data Availability Statement:
The data that support the findings of this study are openly available. Among them, dataset1-dataset5 can be downloaded from "bci.med.tsinghua.edu.cn" (accessed on 9 January 2023) The download instructions for dataset6 can be found in the article written by Nakanishi et al. [12].