Representation Learning for EEG-Based Biometrics Using Hilbert–Huang Transform

: A promising approach to overcome the various shortcomings of password systems is the use of biometric authentication, in particular the use of electroencephalogram (EEG) data. In this paper, we propose a subject-independent learning method for EEG-based biometrics using Hilbert spectrograms of the data. The proposed neural network architecture treats the spectrogram as a collection of one-dimensional series and applies one-dimensional dilated convolutions over them, and a multi-similarity loss was used as the loss function for subject-independent learning. The architecture was tested on the publicly available PhysioNet EEG Motor Movement/Imagery Dataset (PEEGMIMDB) with a 14.63% Equal Error Rate (EER) achieved. The proposed approach’s main advantages are subject independence and suitability for interpretation via created spectrograms and the integrated gradients method.


Introduction
Password-based authentication is being replaced by a more reliable biometric-based authentication [1]. Biometric-based authentication uses a person's unique biological characteristics for recognition. Some of the most commonly used biometric traits are a finger or palm print, the iris pattern, the timbre and spectral images of the voice, facial images, handwritten signatures, or regular handwriting [2]. Some requirements must be met for biometrics to be applicable in a real-world setting. In particular, the biometric trait must be universal, persistent, and easy to measure, and biometric-trait-based identification systems must have high performance and recognize the identity with sufficient accuracy for practical applications [3]. Most biometric authentication systems also require the user to be physically present for authorization [4]. Considerably, the most important advantage of biometric authentication is that the user experience is usually convenient and fast [5]. Modern smartphones use fingerprint and facial recognition systems, which work fairly quickly for the end-user and partially bypass the problem of forgetting a password. Among the biometric authentication systems that have not yet become widespread, we can highlight those that rely on the use of EEG data.
EEG-based systems currently have many advantages over traditional methods and have attracted considerable research interest [6]. At this point, biometric EEG signals cannot be easily replicated, ensuring that the user is alive and well, making it a more reliable choice for identity verification, although the possibility of EEG signals being faked or compromised still exists [7]. EEG data can be used not only for authentication, but also for other purposes (emotion recognition, sleep, and health studies). In [8], the researchers created a new automated sleep staging system based on an ensemble learning stacking model that integrates Random Forest (RF) and eXtreme Gradient Boosting (XGBoosting), achieving 90.56% accuracy. In [9], EEG data from six electrodes were used to detect stroke patients with the C5.0 decision tree machine learning method achieving 89% accuracy. In [10], support vector machine was also used to distinguish stroke patients from healthy subjects (98% accuracy using only two electrodes versus 95.8% accuracy achieved in [11] using electrocardiogram (ECG) data and the random tree model). EEG data can also be used for the classification of Parkinson's Disease (PD), as shown in [12] (the authors used Discriminant Function Analysis (FDA) and achieved 62% accuracy on EEG data alone and 98.8% accuracy combining EEG and Electromyogram (EMG) data). The classification of patients vs. controls for the diagnosis of PD in [13] was performed using a 13-layer neural net (88.2% accuracy). The multifunctionality of EEG data can help improve the reliability of an authentication system based on EEG data. For example, EEG data can change depending on the state and emotions of the user [14], which provides some protection in case the user is forcibly being scanned in a life-threatening situation. State-of-the-art methods (a dynamical graph convolutional neural network in [15], random forest in [16], k-NN in [17]) can classify emotions using EEG data with more than 80 % accuracy [18]. Multiple biometric data, such as facial recognition, can be used for surveillance without notifying the user, but in the case of EEG data, data extraction stops when the device is removed from the head [19].
At present, there are many studies on subject recognition using EEG data and machine learning methods. The first such study was conducted by the University of Piraeus in 1999. EEG signals were collected on a single monopolar channel using a mobile EEG device and used to train a vector quantizer network. The accuracy of the trained network was 72-84% [20]. In [21], the researchers used the k-Nearest-Neighbors (k-NN) algorithm and Linear Discriminant Analysis (LDA) to classify data from twenty participants, who were asked to perform two different tasks during signal capture: a hand movement task or an imaginary hand movement task. Accuracy ranged from 94.75% to 98.03%. In [22], a four-level (two convolutional layers and two pooling layers) Convolutional Neural Network (CNN) was used. Thirty subjects were recruited for the experiment. During the first task, participants were asked to remember their faces; during the second task, participants were asked to perform 10-12 eye blinks. The accuracy of this approach was 97.6%.
EEG-based subject-dependent recognition achieved practically perfect accuracy using a single recording session (3.9% EER in [22] using CNN and eye-blinking signals coupled with EEG signals, 99.8% accuracy in [23] using LDA and k-NN). However, the systems that achieve such high accuracy are of little use in real life for two reasons:

1.
Most researchers use EEG data from only one data acquisition session without considering the possibility of the signal being non-stationary; 2.
These approaches work only with a fixed list of users (subject-dependent).
Some researchers have tried to study and solve the first problem described above-non-stationarity. Reference [24] collected longitudinal EEG data (throughout the year) and found out that in the case of using only single-session data, system classification performance may generalize over session-specific recording conditions rather than over person individual EEG characteristics, achieving 90.8% Rank-1 identification accuracy over multiple sessions. Unfortunately, the collected dataset is not publicly available. In our work, we did not try to solve the first problem and used a dataset with only one recording session.
Regarding the second problem, subject dependency, all previous works had a fixed subject list output. In practical cases, the network should be able to recognize signals it has not encountered before in order to recognize a threat. It is possible to try to work around the problem by building separate classifiers for each user, but this is still impractical since training requires a fairly large amount of time. A subject-independent network has no classes at all. Instead, it takes data from two electroencephalogram signals, converts them into two feature vectors, and compares the distance between them to a certain threshold value. Recently, Reference [24] also considered the subject-independent classification approach, where system classification performance was tested using the leave-one-groupout methodology (the data of one of the users was not presented in the training fold and was present only in the test fold) [25]. In [26], a subject-independent classifier achieved the best validation result using the eyes-open (5.9% EER) and eyes-closed (7.2% EER) states' data (multiple sessions) and 31 s verification phase data. Still, their architecture relied on one-dimensional convolutions performed over downsampled time series data, and the output process of the system was difficult for the average person to interpret, explain, or draw conclusions about, thus creating a new problem: the interpretability of deep learning systems.
Which frequencies contribute the most to the system's output and distinguish its data from that of another subject? To partially solve this problem, we propose to use Hilbert spectrograms (obtained using the Huang-Hilbert transform and Empirical Mode Decomposition (EMD)) as the input and a publicly available dataset-the PhysioNet EEG Motor Movement/Imagery Dataset. Empirical mode decomposition with hand-crafted features has already been applied [27] on the PhysioNet EEG Motor Movement/Imagery Dataset (95.64% accuracy in the subject-dependent scenario, when each subject receives a separately built classifier). We also propose to apply an explainable artificial intelligence method-integrated gradients [28]. Such a method can increase user confidence in authentication system output, validate existing knowledge, question existing knowledge, and generate new assumptions [29].
In this paper, we propose a subject-independent learning method for EEG-based biometrics using Hilbert spectrograms of the data. The proposed neural network architecture treats a spectrogram as a collection of one-dimensional series and applies one-dimensional dilated convolutions over them, and a multi-similarity loss was used as the loss function for subject-independent learning. The architecture was tested on the PhysioNet EEG Motor Movement/Imagery Dataset (PEEGMIMDB) [30] with a 14.63% Equal Error Rate (EER) achieved. The proposed approach's main advantage is the suitability for interpretation via Hilbert spectrograms and the integrated gradients method. The main contributions of this study are as follows: • The subject-independent neural network architecture for EEG-based biometrics using Hilbert spectrograms of the data as the input (trained using the multi-similarity loss); • The use of the integrated gradients method for the proposed architecture's output interpretation.

Dataset
The PhysioNet EEG Motor Movement/Imagery Dataset containing 1 min and 2 min recordings of 109 people from [30] was used. Subjects performed different motor/imagery tasks (4 tasks, 2 min EEG recordings); EEG recordings were also taken in the eyes-open and eye-closed resting states (1 min recordings).

Signal Processing
Initially, the EEG recordings were sets of recordings of 64 time series (from 64 electrodes), recorded using the BCI2000 system with a 160 Hz sampling rate. The data were divided into epochs of 5 s in duration (see Figure 1). To perform such a split and to process the dataset, we used the MNE Python toolkit [31]. We also used data from only 8 channels (O1, O2, P3, P4, C3, C4, F3, F4) to reduce the computational complexity, as [27] showed no significant classification performance drop after using only those 8 channels. We also used EEG data for only eyes-open and eyes-closed states, as it showed the best result in [26] and can be considered more practical from a consumer point of view (less time to authenticate the user while not requiring him/her to perform specific tasks other than him/her being still and resting). After such preprocessing, we had the following dataset dimensions:  To obtain the EEG signal spectrograms, we used the Hilbert-Huang Transform (HHT). In [32], it was concluded that the Hilbert-Huang transform can help eliminate noise from the EEG signal; the HHT is the most suitable method to process signals such as brain electrical signal and, at the same time, has excellent time-frequency resolution, so the HHT is more suitable to analyze non-stationary signals. As a result of the Hilbert-Huang transform's first stage, the signal was decomposed into empirical modes. The Hilbert transform was subsequently applied to the selected modes in the decomposition. This transform allowed an effective decomposition of non-linear and non-stationary signals, which is especially useful in the case of EEG. The transformation also did not require an a priori functional basis for the transformation; the basis functions were set adaptively from the data by the empirical mode function selection procedure. An example of the EEG signal decomposition into empirical modes is shown in Figure 2.
After calculating the instantaneous frequencies from the derivatives of the phase functions by the Hilbert transform of the basis, the result can be represented in the frequencytime form. Given the Nyquist-Shannon sampling theorem and 160 Hz sampling rate, we used 60 frequency bins from 0.1 Hz to 60 Hz. The resulting spectrogram had the shape of [60 frequency bins, 801 points]. An example of the EEG signal transformation in the form of a spectrogram is shown in Figure 3. In order to prevent the mode mixing problem [33], we used the masked sifting method [34], implemented in the EMD Python package [35].
The spectrograms of EEG channel data that we obtained in the previous step were essentially two-dimensional maps. These two dimensions represent fundamentally different units of measurement, one of which is the frequency power and the other time. Therefore, the spatial invariance that two-dimensional CNNs provide may not be suitable for our task. It is better for us to represent spectrograms as a set of stacked time series for different frequency bins [36]. As such, we additionally reshaped the data to 60 time series with 801 points (Figure 4) and stacked the time series over all channels (such a transform can be easily reversed in case we want to use the integrated gradients method) and also applied min-max normalization over the (time series × channel) dimension. No further processing, such as noise removal or band-pass filtering, was applied. The resulting dataset shape was

Deep Learning Methods
One-dimensional dilated convolutions can be successfully utilized to classify time series and are more computationally efficient than LSTM blocks [37]. We propose the multichannel dilated one-dimensional convolutional net architecture described in Table 1 to generate feature vectors from the data. We used metric learning methods to map the data to an embedding space, where similar data are close together and dissimilar data are far apart [38]. In general, this can be achieved using specific embedding and classification losses such as the triplet loss [39], ArcFace Loss [40] or multi-similarity loss [41]. In this work, we used multi-similarity loss and the metric-learning framework [38] implemented in PyTorch.
The first convolution layer uses padding in such a way that the input data shape is preserved (except the channels' dimension) to correctly process the edge values. We also used Parametric Rectified Linear Unit (PReLU) as the activation function, because [42] showed that it can outperform the Rectified Liner Unit function (ReLU).

Model Interpretation
Improving the interpretability of deep models is a critical task for machine learning. One method for solving this problem is to identify the portions of the input data that contribute most to the final model output. However, existing approaches have several drawbacks, such as poor sensitivity to and instability in the specific implementation of the model. Reference [28] discussed two axioms: sensitivity and implementation invariance, which they believe a good interpretation method must satisfy.
The sensitivity axiom means that if two images differ by exactly one pixel (but they have all other pixels in common) and give different predictions, the interpretation algorithm should give a non-zero attribution to that pixel. The axiom of implementation invariance means that the basic implementation of the algorithm should not affect the result of the interpretation method. Researchers have used these principles to develop a new attribution method called integrated gradients.
IG starts with a base image (usually a completely darkened version of the input image) that increases in brightness until the original image is restored. Gradients of class estimates for the input pixels are computed for each image and averaged to obtain a global importance value for each pixel. Besides the theoretical properties, IG thus also solves another problem with vanilla gradient ascent: saturated gradients. Since the gradients are local, they do not reflect the global importance of pixels, but only the sensitivity at a particular input point. By changing the image brightness and calculating gradients at different points, IG can obtain a more complete picture of the importance of each pixel. In our work, we used the PyTorch-based Captum [43] framework implementation of integrated gradients and call the output of the integrated gradients an importance map. The block diagram featuring all output steps is shown in Figure 5.  Figure 5. The proposed method framework.

Model Training
To test the architecture's performance, we used the leave-k-groups-out (the data of multiple users are not presented in the training set and are present only in the testing set) validation methodology. GroupKFold (with k = 5) from the scikit-learn package [44] was used as an iterator variant with non-overlapping groups. The same group would not appear in two different CV testing sets/folds (the number of distinct groups has to be at least equal to the number of folds). The folds were approximately balanced (the number of distinct groups was approximately the same in each fold). There were 22 (21 in the last fold) subjects' data appearing only in the test fold during each CV iteration. Each epoch, 10 data samples per class in the training fold were randomly selected, forming batches. For model training, we used the Adam optimizer (lr = 1 × 10 −4 , weight_decay = 1 × 10 −3 , 500 epochs).
After training, we generated 128-unit l2-normalized feature vector representations of the input data and computed the cosine distance matrix for the generated representations. After this, the sklearn [44] classifier CalibratedClassifierCV (using LinearSVC as a base estimator) was used to calculate the confusion matrix over different distance thresholds. In such a way, we could obtain the Equal Error Rate (EER), which is a metric always used in state-of-the-art EEG-based verification systems [45]. The EER is the location on a Detection Error Tradeoff (DET) curve where the false acceptance rate and false rejection rate are equal. In general, the lower the equal error rate value, the higher the accuracy of the biometric system is. The obtained EER value was 14.63%. The feature space with training fold samples is visualized in Figure 6 using the TSNE method [46].
The hardware used in this study consisted of one Nvidia Tesla T4 GPU card (320 Turing Tensor cores, 2560 CUDA cores, and 16 GB of GDDR6 VRAM), one 8-core CPU, and 64 GB of RAM. The DNN model was trained using the GPU implementation of PyTorch, while all other processes used the CPU. The Python programming language was used for the present study. Along with it, some libraries in addition to the ones already mentioned before were also employed: Keras [47], NumPy [48], Matplotlib [49].

Model Interpretability
After training, integrated gradients method can be applied to the model. An example output is shown in Figure 7. The integrated gradients method output in our case can be summed over the time dimension or the channel dimension. Figures 8 and 9 show the integrated gradients method output for spectrograms of four subjects, summed over the time dimension. Here, Channels 1-8 correspond to the (O1, O2, P3, P4, C3, C4, F3, F4) channels. It can be clearly seen which channels and frequencies were more important for the model feature vector output. Figure 8 demonstrates that there was a large variability within the same class and a small separation between two different classes (they look alike). We can additionally sum importance maps over the channel dimension to see which frequencies are more important for the model feature vector output and more clearly visually distinguish importance maps for each class (see Figures 10 and 11).

Discussion
The proposed architecture was tested on the publicly available PEEGMIMDB dataset with a 14.63% Equal Error Rate (EER) achieved. It had a worse EER value than in [26] (Single-Session Enrollment (SSE) and Short Time Distance (STD) with deep representations with channel-specific CNN modeling achieved an 8.1% EER and a 6.8% EER for the eyes-closed and eyes-open states, respectively; the dataset used is not publicly available), which may have contributed to different dataset subject numbers (109 in our case vs. 50 subjects in [26]), but our proposed approach's main advantage is its suitability for interpretation via the created spectrograms and the integrated gradients method (we operated on spectrograms in the time-frequency domain, and Reference [26] operated only in time domain). In some cases, the difference can not be clearly seen, as in Figure 8. However, we can additionally sum importance maps over the channel dimension to see which frequencies are more important for the model feature vector output and more clearly visually distinguish importance maps for each class (see Figures 10 and 11).

Conclusions
The proposed neural network architecture treats Hilbert spectrogram as a collection of one-dimensional series and applies one-dimensional dilated convolutions over them. A multi-similarity loss was used as the loss function for subject-independent learning. The architecture was tested on the publicly available PEEGMIMDB dataset with a 14.63% Equal Error Rate (EER) achieved. Our proposed approach's main advantage was the suitability for interpretation via the created spectrograms and integrated gradients method (we operated on spectrograms in the time-frequency domain, and Reference [26] operated only in the time domain). Future work will focus on using the Hilbert holospectrum to improve system accuracy.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data available in a publicly accessible repository The data presented in this study are openly available in PhysioNet repository at DOI: 10.13026/C28G6P, reference number [50].