Leveraging Multiple Distinct EEG Training Sessions for Improvement of Spectral-Based Biometric Verification Results

Most studies on EEG-based biometry recognition report results based on signal databases, with a limited number of recorded EEG sessions using the same single EEG recording for both training and testing a proposed model. However, the EEG signal is highly vulnerable to interferences, electrode placement, and temporary conditions, which can lead to overestimated assessments of the considered methods. Our study examined how different numbers of distinct recording sessions used as training sessions would affect EEG-based verification. We analyzed the original data from 29 participants with 20 distinct recorded sessions each, as well as 23 additional impostors with only one session each. We applied raw coefficients of power spectral density estimate, and the coefficients of power spectral density estimate converted to the decibel scale, as the input to a shallow neural network. Our study showed that the variance introduced by multiple recording sessions affects sensitivity. We also showed that increasing the number of sessions above eight did not improve the results under our conditions. For 15 training sessions, the achieved accuracy was 96.7 ± 4.2%, and for eight training sessions and 12 test sessions, it was 94.9 ± 4.6%. For 15 training sessions, the rate of successful impostor attacks over all attack attempts was 3.1 ± 2.2%, but this number was not significantly different from using six recording sessions for training. Our findings indicate the need to include data from multiple recording sessions in EEG-based recognition for training, and that increasing the number of test sessions did not significantly affect the obtained results. Although the presented results are for the resting-state, they may serve as a baseline for other paradigms.


Introduction
Hans Berger was the first to record and publish a paper on electroencephalography (EEG) signals [1]. Since then, the interest in using bioelectrical signals from the head surface has grown in popularity. Over the years, technical achievements and improvements in the quality of EEG-collecting devices and signal analysis methods have enabled the use of EEG in many fields.
One of the advantages of the EEG signals is that they can be non-invasively acquired by placing electrodes on the surface of the subject's head, which allows the determination of the distribution of bioelectrical activity in the brain. The resulting data are expressed as a function of voltage over time. In addition, this method is relatively inexpensive and has the highest time resolution compared to other techniques for examining brain function. It is widely used in various medical areas, as well as in EEG-neurofeedback training [2], emotion recognition [3], and neuroscience research. It allows for the acquisition of information on cognitive functions [4], sleep patterns [5], neurological disorders like epilepsy [6,7], or the depth of anesthesia [8]. Machine learning (ML) and neural network (NN) methods are often used to analyze EEG signals. They have been successfully applied to identify patterns of attention deficit hyperactivity disorder (ADHD) [9] and mental one study [12], the data from overlapping sub-bands in the alpha band (7-10 Hz, [8][9][10][11][9][10][11][12] from REC signals were analyzed using the FFT (Fast Fourier Transform) and neural network classifier. The authors analyzed two test cases; one for verification and the second for identification. In the first one, they verified the identity of four individuals and obtained Correct Classification scores ranging from 80% to 100% depending on the analyzed sub-band and individual.
In 2010, researchers [16] used features extracted with wavelet packet decomposition and a neural network to achieve the Correct Recognition Rate (CRR) of 81% based on four channels of EEG and 71% for two channels of EEG. The studies were conducted on a group of 10 participants with five separate sessions conducted over two weeks. The results were presented for REO and REC signals.
An attempt was made to evaluate the impact of the number of distinct recording sessions on the recognition results [31]. The authors analyzed the permanence of a database of 45 healthy adults with five to six distinct recording sessions each, collected over three years. They analyzed the Event-Related Potential (ERP) EEG and used the Hidden Markov Models as classifiers. They pointed out that using only one recording session to perform EEG-based recognition may lead to overestimating the results.
Most studies seem to focus more on analyzing inter-individual differences than their permanence [21]. Some studies have attempted to address this issue by increasing the number of distinct recording sessions and analyzed data from five to six sessions [16,31,32]. However, many studies are developing complex algorithms based on data from one recording session [15,22,24,[33][34][35][36][37]; or on data from more sessions but performed on the same day [38]; or with a limited number of people with multiple distinct sessions [12][13][14]. Therefore, there is a strong need for more comprehensive research on the impact of the number of recording distinct sessions on the permanence of EEG-based recognition [23]. The subject is still of interest, as evidenced by recently published articles on the use of EEG in biometrics [19,25,31,32,39,40].
Studies reported that the popular methods for extraction of the EEG features include Time-Series Models such as Autoregressive Models, Connectivity-Based Methods, and Frequency-Based Methods, mostly based on Power Spectral Density (PSD) or Wavelet Transform [23,40]. Moreover, others [23] described methods based on Cosine Transform and Mel-Frequency Cepstrum, and other methods based on supervised learning, such as Linear Discriminant Analysis (LDA), Common Spatial Pattern (CSP), and NN. They also described unsupervised learning methods such as Statistical Methods (Principal Component Analysis, Independent Component Analysis), Clustering-Based Methods, Restricted Boltzmann Machines, Autoencoders, and Generative Adversarial Networks. In these papers [23,40], the classification methods were reported, such as k-Nearest Neighbors, Neural Networks, Kernel Methods (e.g., Support Vector Machine), and Deep Learning.
Our studies aimed to evaluate how the number of distinct EEG recording sessions would affect the verification results. Taking into account the results of our previous studies [41] and studies performed by other authors [12,35,36], we decided to focus on the PSD coefficients. Because we tried to simulate actual verification conditions, we did not process any artifact rejection.

Data Collection
In the following, we analyzed 580 recordings originating from 29 adult participants (14 females and 15 males, with a mean age of 28.17 years, ranging from 23 to 44 years). For each participant, 20 EEG recording sessions were conducted over an average of approximately 70 days (ranging from 43 to 129 days). Additionally, we simulated an external attack by impostors using recordings from an additional 23 subjects who were not included in the previous analyses. In total, 603 examinations were used in the study.
The 19 electrodes were placed according to the 10-20 international electrodes placement system [42]. The impedance was kept under 10 kΩ. The examinations were performed with a reference A2 electrode placed on the right ear. The sampling frequency was 500 Hz. All signals were acquired using the ExG-32 headbox manufactured by ELMIKO BIOSIG-NALS (Milanówek, Poland). The local Bioethical Commission of the Copernicus University in Toruń approved the experiments. Participants had normal or corrected-to-normal vision, and were right-handed.
The study was performed using the resting-state paradigm with eyes open as the EEG signal is heritable [43,44], which suggests that the resting-state EEG consists of persistent individual features that can be used in biometric recognition. This paradigm has always accompanied the development of neuroscience. Since the seminal discovery of fMRI restingstate connectivity [45], the resting-state in fMRI and EEG/MEG studies is at the heart of human connectome analyses as it reflects the relatively natural state of mind. Compared to other paradigms, it is practical in real-world applications. During acquisition, it does not require additional software, and the participant does not need to perform intentional actions, which can be exhausting if carried out regularly.

Preprocessing and Features Extraction
For each examination, we analyzed 3 min of data after the first 30 s. The data were filtered in the frequency range of 0.2-70 Hz, and the Notch filter (50 Hz) was applied. All analyses were performed in MATLAB 2020b.
To improve the signal-to-noise ratio (SNR), we re-referenced channels to the Common Average Reference (CAR) composed of the 19 electrodes [46]: where x is the analyzed data vector, i is the sample number, and ch is the channel number. Continuing our previous studies [41], every examination was divided into 7.5 s segments. Welch's power spectral density was estimated for each frequency in the 1-45 Hz range. We used a 1 s sliding Hamming window with 0.5 s overlap. The analyses were performed on both the raw PSD estimate coefficients and converted to the decibel scale: where P is the PSD estimate coefficients, i is the frequency bin, and ch is the channel number. The described preprocessing procedure is presented in Figure 1. The 19 electrodes were placed according to the 10-20 international electrodes placement system [42]. The impedance was kept under 10 kΩ. The examinations were performed with a reference A2 electrode placed on the right ear. The sampling frequency was 500 Hz. All signals were acquired using the ExG-32 headbox manufactured by ELMIKO BIOSIGNALS (Milanówek, Poland). The local Bioethical Commission of the Copernicus University in Toruń approved the experiments. Participants had normal or corrected-tonormal vision, and were right-handed.
The study was performed using the resting-state paradigm with eyes open as the EEG signal is heritable [43,44], which suggests that the resting-state EEG consists of persistent individual features that can be used in biometric recognition. This paradigm has always accompanied the development of neuroscience. Since the seminal discovery of fMRI resting-state connectivity [45], the resting-state in fMRI and EEG/MEG studies is at the heart of human connectome analyses as it reflects the relatively natural state of mind. Compared to other paradigms, it is practical in real-world applications. During acquisition, it does not require additional software, and the participant does not need to perform intentional actions, which can be exhausting if carried out regularly.

Preprocessing and Features Extraction
For each examination, we analyzed 3 min of data after the first 30 s. The data were filtered in the frequency range of 0.2-70 Hz, and the Notch filter (50 Hz) was applied. All analyses were performed in MATLAB 2020b.
To improve the signal-to-noise ratio (SNR), we re-referenced channels to the Common Average Reference (CAR) composed of the 19 electrodes [46]: Where x is the analyzed data vector, i is the sample number, and ch is the channel number. Continuing our previous studies [41], every examination was divided into 7.5 s segments. Welch's power spectral density was estimated for each frequency in the 1-45 Hz range. We used a 1 s sliding Hamming window with 0.5 s overlap. The analyses were performed on both the raw PSD estimate coefficients and converted to the decibel scale: Where P is the PSD estimate coefficients, i is the frequency bin, and ch is the channel number. The described preprocessing procedure is presented in Figure 1.

Data Set Creation and Results Evaluation
Following the results of our previous studies [41], we decided to use a feed-forward artificial neural network [47] with one hidden layer and two output neurons. In the output layer, we used the linear transfer function, and in the hidden layer, we used a sigmoidal hyperbolic activation function. Our previous study [41] confirmed that one hidden neuron in the hidden layer should be sufficient for verification. This was also confirmed in the current study. We performed a two-class classification [18]. A dedicated neural network trained with the Levenberg-Marquardt backpropagation algorithm [48] was created for each participant.

Data Set Creation and Results Evaluation
Following the results of our previous studies [41], we decided to use a feed-forward artificial neural network [47] with one hidden layer and two output neurons. In the output layer, we used the linear transfer function, and in the hidden layer, we used a sigmoidal hyperbolic activation function. Our previous study [41] confirmed that one hidden neuron in the hidden layer should be sufficient for verification. This was also confirmed in the current study. We performed a two-class classification [18]. A dedicated neural network trained with the Levenberg-Marquardt backpropagation algorithm [48] was created for each participant.
Since this is an analysis of the verification for each person, separate training and testing data sets were created. The core part of analyses always included the last five recordings. To stratify the data set, an equal number of impostor segments were randomly drawn from the impostor data set.
By increasing the number of recordings used for training from one to 15, we analyzed their impact on the test results. The data-division diagram is presented in Figure 2. In the training data set, we performed 8-fold cross-validation for each number of recordings. Each time, 10 neural networks were created, from which the best one was selected based on the result of the validation data. Since the genuine and impostor sets are unequal, we performed stratified cross-validation to make the results non-biased. The final test results were averaged.
Since this is an analysis of the verification for each person, separate training and testing data sets were created. The core part of analyses always included the last five recordings. To stratify the data set, an equal number of impostor segments were randomly drawn from the impostor data set.
By increasing the number of recordings used for training from one to 15, we analyzed their impact on the test results. The data-division diagram is presented in Figure 2. In the training data set, we performed 8-fold cross-validation for each number of recordings. Each time, 10 neural networks were created, from which the best one was selected based on the result of the validation data. Since the genuine and impostor sets are unequal, we performed stratified cross-validation to make the results non-biased. The final test results were averaged.
Where TP (True Positive) is the number of segments correctly recognized as a claimant (the person who the neural network had learned to recognize); TN (True Negative) is the number of segments correctly rejected as an impostor (an unauthorized person); FP (False Positive) is the number of segments that improperly recognized an impostor as a claimant; The results of the EEG-based verification were assessed using the following classification measures: accuracy (ACC), sensitivity (SEN), specificity (SPEC), and precision (PREC): where TP (True Positive) is the number of segments correctly recognized as a claimant (the person who the neural network had learned to recognize); TN (True Negative) is the number of segments correctly rejected as an impostor (an unauthorized person); FP (False Positive) is the number of segments that improperly recognized an impostor as a claimant; and FN (False Negative) is the number of segments incorrectly rejected as a claimant. In this paper, sensitivity is considered as the ability of the algorithm to correctly verify the person as the claimant; specificity as the ability to properly recognize the impostor; accuracy as the overall factor of correctly recognized segments; and precision as the factor of correctly recognized claimant segments from all segments recognized as the claimant.

Results of the Experiment
The studies on EEG-based human recognition often use databases containing a limited number of distinct, separated-in-time sessions. However, models based on one recording session cannot provide proper model evaluation. The EEG signals are vulnerable to inter-Sensors 2023, 23, 2057 6 of 16 ferences, and while conducting training and testing based on only one recording session, there is a possibility that the signals can be heavily noised or distorted by environmental or technical interferences. Furthermore, in extreme cases, the learning mechanism may adapt to these interferences instead of EEG patterns. Moreover, it can adapt to the spontaneous brain activity. Thus, among other aspects, we decided to analyze the influence of the number of recording sessions used for training on the performance of EEG-based recognition. In our experiments, all sessions for each participant are delivered from 20 distinct sessions separated in time. The crucial aspect of this evaluation is that the test data sets are completely separated in time from the training ones. This approach should exclude the impact of additional unpredictable variables such as temporal daily EEG changes or technical aspects and interferences. In real-life applications of such a system, it is crucial to properly verify the distinct signals based on the smallest possible number of training sessions. We verified our hypothesis by analyzing the impact of an increasing number of sessions used for training on the accuracy results of the last five sessions of each participant.

Training Sessions: 1-15, Testing Sessions: 16-20
In the first series of experiments, we studied the impact of the number of sessions in the training data set on the classification measures which reflect the performance of the model. To determine the number of sessions after which there is no significant improvement, we were changing the number of recording sessions in the training data set from one to 15. Meanwhile, we evaluated how different methods of normalization of PSD estimate coefficients would affect the results. Although we analyzed many methods of normalizing PSD, the best results were obtained for converting it to the decibel scale. That comparison is also one of the original results that we have not seen before in EEG-based recognition. The statistical classification measures (3)-(6) were tested each time on the last five recordings. In the first approach (I), we analyzed the raw PSD estimate coefficients. In the second approach (II), the PSD estimate coefficients were converted to the decibel scale. Figure 3 shows the averaged results (over all participants) for both approaches. The accuracy of the model is presented with the blue line, sensitivity with the red line, specificity with the yellow line, and precision with the purple line. It should be noted that, to highlight the differences in both approaches, the y-axis was limited to 65-100%. The standard deviations of the analyzed statistical measures are presented in Figure 4. The y-axis in this figure was limited to 0-25%.  The results presented in Figures 3 and 4 indicate that, as the number of training sessions increases, the classification measures tend to improve for both approaches. The performance measures increase, and the standard deviations decrease significantly. This is particularly noticeable for the small number of sessions (1)(2)(3)(4)(5)(6)(7)(8). One of the most important conclusions is that PSD estimate coefficients provide better results for all considered classification measures when converted to the decibel scale. Moreover, the results for the II approach tend to stabilize faster and show less fluctuation between the obtained results in the last sessions. The results from Figures 3 and 4 are also presented in Tables 1 and 2 to enable a more detailed comparison.  The results presented in Figures 3 and 4 indicate that, as the number of training sessions increases, the classification measures tend to improve for both approaches. The performance measures increase, and the standard deviations decrease significantly. This is particularly noticeable for the small number of sessions (1)(2)(3)(4)(5)(6)(7)(8). One of the most important conclusions is that PSD estimate coefficients provide better results for all considered classification measures when converted to the decibel scale. Moreover, the results for the II approach tend to stabilize faster and show less fluctuation between the obtained results in the last sessions. The results from Figures 3 and 4 are also presented in Tables 1 and 2 to enable a more detailed comparison.  For the clarity of the paper, we decided to only consider approach II going forward. This choice is justified because the classification measures obtained for the PSD estimate coefficients converted to the decibel scale provided significantly better results (for the accuracy, the p-value obtained with the Wilcoxon Rank Sum Test is equal to 0.00015).
To reject the hypothesis that the observed improvement of classification measures with the increasing number of sessions is the result of only an increase in the size of the data training sets, we analyzed how data variety influences a single training session. For each person, we created a training data set with a size corresponding to a single training session (24 segments). We added one to two segments out of 15 training examinations and analyzed the results for five test sessions. An equal number of impostor segments were randomly drawn from the impostor data sets for the training and testing sets. The obtained accuracy was 88.2 ± 6.8%; sensitivity was 90.1 ± 10.1; specificity was 86.3 ± 6.4%; and precision was 87.4 ± 5.7%. To compare the results for a different number of training sessions, we first ve whether the obtained data came from the standard normal distribution by performin one-sample Kolmogorov-Smirnov test. The p-values obtained are presented in Ta The test sample was the vector of averaged accuracy obtained for each participant. Table 3. Results of the Kolmogorov-Smirnov test. The p-values lower than 0.05 (which mean do not come from the standard normal distribution) are marked in red.

Session
No. Next, we analyzed how the number of sessions used in the training data set aff the obtained results. Since we do not have sufficient evidence to reject the hypothes the data from all observations follow the normal distribution, we performed the Wil To compare the results for a different number of training sessions, we first verified whether the obtained data came from the standard normal distribution by performing the one-sample Kolmogorov-Smirnov test. The p-values obtained are presented in Table 3. The test sample was the vector of averaged accuracy obtained for each participant. Table 3. Results of the Kolmogorov-Smirnov test. The p-values lower than 0.05 (which means they do not come from the standard normal distribution) are marked in red.

Session
No. Next, we analyzed how the number of sessions used in the training data set affected the obtained results. Since we do not have sufficient evidence to reject the hypothesis that the data from all observations follow the normal distribution, we performed the Wilcoxon Rank Sum Test to compare the accuracy values for the different numbers of sessions used to create the training data set. The results are presented in Table 4. The multiple comparisons of accuracy showed no evidence of a significant difference between eight or more sessions used for training, which suggests that there is no need to further increase the number of sessions in the training data set to more than eight. Table 4. Results of the Wilcoxon Rank Sum Test. The p-values lower than 0.05 (which means significant differences between groups) are marked in red.

Session
No.  Since we cannot prove differences between using eight sessions or more, we decided to study the behavior of this number of sessions in the case of an increased number of examinations in the testing data sets. Therefore, we changed the number of examinations included in the training data set from one to eight and used the data from the last 12 recording sessions as the testing data set. The obtained averaged results of the classification measures and standard deviations are presented in Figure 6. The corresponding results are also presented in Table 5 in numerical form.
To extend our analysis, we investigated whether an even greater increase in the testing data set would affect the analyzed classification measures. For this reason, we also evaluated the model's performance using one training session and 19 testing sessions. The achieved accuracy was 81.1 ± 8.3%; sensitivity was 74.7 ± 15.5%; specificity was 87.4 ± 5.6%; and precision was 85.8 ± 6.8%. There was no evidence for a significant difference between the accuracy obtained for one training session and 19 testing sessions and the accuracy obtained for one training session and five testing sessions (the p-value obtained with the Wilcoxon Rank Sum Test is equal to 0.62; see Table 2 for details).
to study the behavior of this number of sessions in the case of an increased number of examinations in the testing data sets. Therefore, we changed the number of examinations included in the training data set from one to eight and used the data from the last 12 recording sessions as the testing data set. The obtained averaged results of the classification measures and standard deviations are presented in Figure 6. The corresponding results are also presented in Table 5 in numerical form.  To extend our analysis, we investigated whether an even greater increase in the testing data set would affect the analyzed classification measures. For this reason, we also evaluated the model's performance using one training session and 19 testing sessions. The achieved accuracy was 81.1 ± 8.3%; sensitivity was 74.7 ± 15.5%; specificity was 87.4 ± 5.6%; and precision was 85.8 ± 6.8%. There was no evidence for a significant difference between the accuracy obtained for one training session and 19 testing sessions and the accuracy obtained for one training session and five testing sessions (the p-value obtained with the Wilcoxon Rank Sum Test is equal to 0.62; see Table 2 for details).

Simulated Impostor Attack
In the final part of our analysis, we tested the robustness of the proposed EEG-based verification approach by simulating an attack by external impostors. For this, we used 23 recordings of additional subjects who were not involved in training the networks in any way. Our goal was to confirm that these external subjects would be recognized as impostors and, thus, would not introduce any additional external impact on the results.
We conducted this analysis using five examinations for testing and varying the number of training sessions from one to 15. In total, we extracted 552 attack attempts (23 impostors, 24 attempts from 7.5 s segments) from the additional 23 recordings, which were introduced to the dedicated neural network for each participant (29 in total). This resulted in 128,064 attack attempts (552 attack attempts for each of 29 participants with eight networks resulting from cross-validation). The results for the different numbers of sessions used for training are presented in Table 6 and in Figure 7.  Once again, we used the Wilcoxon Rank Sum Test to compare the average number of non-rejected impostors between different numbers of training sessions. The results are presented in Table 7. The multiple comparisons of accuracy showed no evidence for a difference between six or more sessions used for training. The only exception was between eight sessions and 15 sessions. Table 7. Results of the Wilcoxon Rank Sum Test for the simulation of impostor attack. The p-values lower than 0.05 (which means significant differences between groups) are marked in red.

Session
No. Once again, we used the Wilcoxon Rank Sum Test to compare the average number of non-rejected impostors between different numbers of training sessions. The results are presented in Table 7. The multiple comparisons of accuracy showed no evidence for a difference between six or more sessions used for training. The only exception was between eight sessions and 15 sessions.

Discussion
This paper focused on analyzing the impact of the number of EEG distinct recording sessions used for training on the results of spectral-based method for EEG-based people verification of the claimed identity. For 29 adult participants, we recorded 20 examinations, each performed on a different day. In our studies, we also used 23 single recordings from an additional 23 subjects. In total, for our research, we used 603 examinations, which outperformed other similar studies in terms of the quantity of the database. Initially, we analyzed the results for the two approaches: in the first, we evaluated the features based on the raw PSD estimate coefficients, and in the second, the PSD estimate coefficients were converted to the decibel scale. The assessment was based on the standard classification measures-accuracy, specificity, sensitivity, and precision-as described in Section 2.3. The results showed that converting the PSD estimate coefficients to the decibel scale improved the results for all measures regardless of the number of examinations used for training. The average results increased and their standard deviation decreased; for example, for 15 training sessions, accuracy increased from 92.6 ± 5.6% to 96.7 ± 4.2%, sensitivity increased from 92.2 ± 9.4% to 94.8 ± 8.0%, specificity increased from 93.0 ±3.5% to 98.5 ± 1.1%, and precision increased from 92.9 ± 3.7% to 98.5 ± 1.1% (see Tables 1 and 2 for comparison).
Observing the results obtained for the classification measures in Figures 3 and 4, one can see that they improve with the number of examinations used for training. However, after a certain number of recordings used for training, this improvement became statistically insignificant. Our analysis showed that, for five examinations used for testing, there is no statistical difference between eight and 15 recordings used for training. For eight training and five testing examinations, the following results were obtained: accuracy: 95.3 ± 4.7%; sensitivity: 93.2 ± 8.6%; specificity: 97.5 ± 1.7%; and precision: 97.4 ± 1.8%. For comparison, the results for 15 training and five testing recordings were as follows: accuracy: 96.7 ± 4.2%; sensitivity: 94.8 ± 8.0%; specificity: 98.5 ± 1.1%; and precision: 98.5 ± 1.1%. Although the results for more sessions seem to be better in all cases, these differences were not statistically significant.
To our best knowledge, this kind of analysis has not been attempted previously. Although the presented results are for the resting-state, they may serve as a baseline for other paradigms. The problem of using data sets with an insufficient number of distinct recordings is still a concern [17,19,31,32,39,40,[49][50][51]. Using data from the same recording session to train and test the model may result in changes in the EEG and its spectral characteristics not being captured. We clearly showed that the EEG changes over time. If more than one recording session is used for training, the results are better because the model learns the nonstationary behavior of the characteristics.
Particular attention should be paid to sensitivity which, consistently, in each case, achieved the worst results among all classification performance measures. At the same time, after a relatively small number of sessions, specificity and precision reached saturation levels, after which there was no significant improvement. In addition, concerning the other measures, a higher standard deviation was achieved in sensitivity. This may suggest some intra-individual changes between sessions which are so large that the system more often falsely rejects the segments of genuine claimants. This result is not surprising, since EEG tends to change slightly on a daily basis, overlap with muscle activity [52], and may be affected by the emotional state [3]. From the biometrics perspective, this behavior seems appropriate: when one is trying to gain access to a secure place, it is better for the system to reject a genuine claimant than to accept an unauthorized person.
Nevertheless, this suggests that more studies are needed to properly evaluate the performance of biometric systems based on EEG. Otherwise, there is a danger that the system will not correctly recognize people and reject impostors. Nevertheless, the achieved results are promising despite using simple features based on the PSD estimate coefficients.
We also verified how the observed improvement of the model performance with increasing numbers of sessions was due to an increase in the number of segments in the training data set or an increase in their diversity (see Section 3.1. for details). The obtained outcomes (ACC: 88.2 ± 6.8%, SEN: 90.1 ± 10.1%, SPEC: 86.3 ± 6.4%, PREC: 87.4 ± 5.7%) were lower than those obtained for all segments from 15 training sessions (ACC: 96.7 ± 4.2%, SEN: 94.8 ± 8.0%, SPEC: 98.5 ± 1.1%, PREC: 98.5 ± 1.1%). However, they were higher than those using data from only one training session (ACC: 79.1 ± 10.7%, SEN: 71.4 ± 20.6%, SPEC: 86.8 ± 7.4%, PREC: 83.1 ± 10.6%). The difference is particularly noticeable in sensitivity, where the obtained sensitivity was 71 ± 20.6% using only one training session, without information from the others.
Since eight recording sessions for the training seemed sufficient to carry out the verification, we studied how the outcomes would be affected by increasing the test set. For this reason, the last 12 recordings were included in the testing data set. Then, the following results were obtained: accuracy: 94.9 ± 4.6%; sensitivity: 92.4 ± 8.5%; specificity: 97.3 ± 1.8%; and precision: 97.2 ± 1.9%. These outcomes are almost identical to those obtained using eight recordings and a smaller testing set (see Table 2 for details). In addition, following this logic, we also decided to study the behavior of the system with only one training examination and 19 test recordings for the analysis. For the obtained results (accuracy: 81.1 ± 8.3%; sensitivity: 74.7 ± 15.5%; specificity: 87.4 ± 5.6%; precision: 85.8 ± 6.8%), we did not find evidence for the significant difference. The obtained results suggest that five studies may be sufficient for testing.
The results obtained in the simulated attack by impostors correspond to the specificity obtained for five testing examinations and the number of training sessions varying from one to 15. As the number of sessions increases, the value of specificity decreases. The obtained number for the specificity was 98.5 ± 1.1%, while the percentage of successful impostor attacks over all attack attempts was 3.1 ± 2.2%. The multiple comparisons of accuracy showed that six examinations were not significantly different from 15 examinations used for training (see Table 7 for details). This number is lower than that calculated based on classification measures, but this should not be surprising. FAR is associated with specificity which, in our studies, stabilized quite quickly, while the deficiencies in stability primarily concerned sensitivity. However, the FAR coefficient still needs to be reduced. This may be achieved by combining this method with another one; e.g., entropy or connectivity measures [23,40]. In this paper, we assessed the results of people verification obtained for analysis of 7.5 s EEG segments, averaged for all participants. Increasing the length of the analyzed segments as well as the number of segments used for verification should decrease the FAR coefficient. The results may also benefit from analyzing the impact of the EEG sub-bands and limiting the method only to the sub-bands that have the greatest impact on the verification results, which would allow for further noise reduction. It should be noted that the results achieved for EEG-based biometrics are still slightly worse compared to the other biometrics methods developed over many decades, but the research on EEG biometrics is still in its fledgling state [53].

Conclusions
One of the challenges in biometrics using methods based on the EEG signal analysis is to train mechanisms for automatic people verification in the shortest possible time while ensuring sufficient verification results. In most of the results published in the literature, with only a few exceptions, training and testing data sets come from the same EEG recordings, which has several drawbacks:

1.
Using the same recordings for training and testing may cause data leakage, resulting in false high classification accuracy. 2.
One session may not reflect the technical and environmental conditions encountered in various practical situations as the practical application of the considered EEG-based method involves multiple sessions that bring additional variance due to differences in electrode montage, impedance, and the mental state of a subject. None of these sources of variance is accessible to machine learning procedures worked out from one session's data set. 3.
In practical applications, some electrodes may not contact properly and not always in the same way.
This study investigated how different numbers of distinct recording sessions would affect EEG-based verification. One of the main conclusions is that increasing the number of test sessions did not significantly affect the result obtained. It turned out to be essential to use more training sessions. This result was due to verification sensitivity, which appears to be less stable than specificity. Our analysis showed that the results improved as the number of training sessions increased. However, more than eight distinct recording sessions for the proposed approach did not make a significant difference. In future studies, we plan to develop a more advanced method for EEG-based verification and reduce the number of electrodes. Moreover, due to the EEG variability, it is essential to perform similar studies under heterogenous conditions. Author Contributions: Conceptualization, R.P., K.J., U.M. and J.R.; methodology, R.P. and K.J.; software, R.P.; validation, R.P. and K.J.; formal analysis, R.P. and K.J.; investigation, R.P. and K.J.; data acquisition, U.M. and J.R.; resources, R.P., K.J. and J.R.; data curation, R.P.; writing-original draft preparation, R.P.; writing-review and editing, K.J., U.M. and J.R.; visualization, R.P.; supervision, K.J.; project administration, K.J. All authors have read and agreed to the published version of the manuscript.