1. Introduction
A brain computer interface (BCI) system is an emerging technology that is regarded as a new way for the human brain to control external devices without the neural pathway [
1]. People can control external devices by imagining the relevant action. Currently, BCI-based rehabilitation therapy is focused mainly on the recognition of motor imagery EEG (MI-EEG) [
2,
3,
4] that is collected when a subject performs a specific motion imagination. However, the high dimension of the MI-EEG signal, which leads to high computational cost, usually brings an obstacle to the implementation of the on-line recognition algorithm and adversely affects the classification accuracy of MI-EEG. Therefore, how to effectively extract the low-dimensional representation hidden in the high-dimensional dataset has been a hotspot in the BCI and machine learning field in recent years [
5,
6,
7]. To solve the computational complexity and data storage problem caused by the high dimension of signals, many dimensionality reduction methods have been used in traditional BCI technology, such as principal component analysis (PCA), independent components analysis (ICA) and multidimensional scaling (MDS). PCA assumes that the samples are not correlated, and its main idea is to calculate a group of new features arranged in the descending order of importance from the original ones that are linear combinations [
8]; ICA is similar to PCA, except that the components are designed to be independent of each other; and MDS is an unsupervised data dimensionality reduction method, in which efforts are taken to find the low dimensional embedding that best preserves the pair-wise distances between the original data points [
9]. All of these methods have been widely used in visualization and feature extraction, for they are well understood and easy to implement. Unfortunately, they have a common inherent limitation, i.e., they are all linear methods, while MI-EEG can be regarded as the output of the brain, which contains much nonlinear geometry information [
10]. The features extracted by the methods will not be adequate and effective [
11], and the classification accuracy will also be affected. The manifold learning theory provides a new way to solve the problem.
In 2000, two nonlinear dimensionality reduction methods, isomap and locally linear embedding (LLE) algorithms, were proposed by Tenenbaum et al. [
12] and Roweis et al. [
13]. They were published in the famous “Science” journal and created a new genre of manifold learning (ML) in machine learning. ML assumes that the high-dimensional spatial data observed in the real world are usually generated by a relatively small number of degrees of freedom [
12], and the data can be embedded in a lower dimensional space according to its preserved geometry [
14]. Since then, ML has begun to emerge in the BCI field. Sadatnejad et al. [
15] proposed a new kernel function method for the symmetric positive definite (SPD) matrix of the MI-EEG signal, and the isomap algorithm was applied to the dimension reduction of the SPD matrix. The KNN classifier was used to classify features, and the experiments yielded a promising result when compared with the most popular manifold learning methods, as well as the common spatial pattern (CSP) technique; Mirsadeghi et al. [
16] calculated the power, covariance and a series of the entropy of the EEG signal to obtain its original features, and the LLE algorithm was implemented to reduce the feature dimension. The depth of anesthesia was estimated by quadratic discriminant analysis, and the average classification rate was 88.4%. Krivov et al. [
10], who came from the Kharkevich Institute, proposed a method to reduce the dimension of the covariance matrix of the MI-EEG signal by using the isomap algorithm, and the concept of Riemannian geometry was introduced. The useful information hidden in MI-EEG was explored from the perspective of spatial structure, and the classification accuracy is as good as the state-of-the-art CSP algorithm. These manifold learning methods have achieved surprising results in the research of MI-EEG. However, they cannot obtain the explicit mapping relationship from the high-dimensional space to the low-dimensional space while acquiring low-dimensional embedding representations. Therefore, it is impossible to directly extract the features from a new sample when carrying out a pattern recognition task. To solve the problem and enhance the resolvability of multi-class datasets, Li et al. [
14,
17] extended the original isomap and proposed the supervised explicit isomap (SE-isomap) algorithm, in which the geodesic distance matrix is calculated with respect to the class label information, and MDS with explicit transformation is adopted instead of the classical MDS used in the isomap. It is regretful that the SE-isomap has not been applied to the feature extraction of the MI-EEG signal.
As we know, MI-EEG is non-stationary and time varying, and it has a clear time-frequency distribution. Both wavelet transform (WT) and wavelet packet decomposition (WPD) are common and effective time-frequency analysis methods [
18,
19,
20], and WPD, which is an extension of WT, can simultaneously decompose the signal into high-frequency components and low-frequency components; thus, the range of time-frequency analysis is broadened, and more time-frequency domain information embedded in a signal is then obtained. Additionally, WPD has been proven to be able to obtain the dynamic characteristics of a signal effectively [
21]. Consequently, WPD has attracted increasing attention from the BCI field in recent years. Yang et al. [
19] applied WPD to extracting the time-frequency features of MI-EEG, and the wavelet packet coefficients were input to the CSP algorithm to obtain a set of six-dimensional eigenvectors. The probabilistic neural network (PNN) was further used to classify the features, and the average recognition accuracy is 88.66%; Luo et al. [
22] extracted the MI-EEG time-frequency information by using WPD and then applied the dynamic frequency feature selection (DFFS) algorithm to select the sub-WPD coefficients associated with the imaging task. The random forests algorithm was used to evaluate the features, and the average classification accuracy from nine different subjects reached 84.06%. It is obvious that WPD has been playing an important role in the feature extraction of MI-EEG. However, the frequency range of each wavelet packet subspace (WPS) is relatively fixed; the same WPSs are selected for different subjects by using the above WPD-based methods, namely as the time-frequency features; the WPD coefficients are covered by the same frequency ranges. This is not good for the extraction of subject-specific features and results in the poor adaptability of WPD-based feature extraction methods. In this paper, an adaptive feature extraction method is proposed for MI-EEG based on WPD and SE-isomap. The optimal wavelet packets (OWPs), reflecting the subject-based features, are autonomously selected, and the coefficients of OWPs are utilized to statistically calculate time-frequency features. In the meantime, SE-isomap is applied to obtain the nonlinear manifold structure features, as well as the explicit nonlinear mapping. The combined features show the perfect performance in classification accuracy and computation efficiency.
The rest of the paper is organized as follows: In
Section 2, the basic principles of WPD, the SE-isomap algorithm and explicit-MDS are briefly introduced.
Section 3 describes the feature extraction method with WPD and SE-isomap algorithm in detail. In the following section, extensive experiments are conducted on a publicly available dataset.
Section 5 concludes the paper.
2. Primary Theory
2.1. Wavelet Packet Decomposition
The sampled signal in WT will be passed through the low-pass and high-pass filters to obtain the corresponding results. Each result is considered as a sub-space; however, only the low-pass filter results will be regarded as the input of the next step and then get another layer of low-pass and high-pass filtered results [
23]. Different from the WT, WPD will transform the results of low-pass and high-pass filters further; therefore, the results of wavelet packet transform can be regarded as a complete binary tree whose root node is the original signal, and the sub-nodes of the next layer are the results of WT [
20,
24]. A three-level sub-band tree is shown in
Figure 1, where
s(0,0) represents the initial space of the dataset,
s(
j,
n) denotes the decomposed WPS,
j represents the number of decomposed levels and
n represents the index of the subspace at the
j-th level. The
k-th WPD coefficient of the
n-th wavelet packet at the
j-th level can be expressed as:
where
and
are a pair of quadrature mirror filters that are irrelevant to the levels, and the relationship between them is as follows:
After that, the decomposition coefficient of the j-th layer can be obtained by iteratively calculating the coefficients of the (j − 1)-th layer.
The frequency range corresponding to the subspace s(j,n) is [], where represents the sampling frequency.
The energy of a signal is decomposed into the time-frequency domain by WPD, and it can be reflected by the unit value of WPD coefficients. Because of the perfect performance of WPD, it is superior to other methods in dealing with the typical non-stationary signals, like MI-EEG [
25].
2.2. SE-Isomap Algorithm
As an extension of original isomap algorithm, SE-isomap [
17] is able to effectively exploit the label information of the training samples to aggregate the samples belonging to the same class and disperse others properly; therefore, the separability of the samples is enhanced and the explicit mapping of the samples from the high-dimensional space to the target space is obtained, which significantly shortens the time of the feature extraction procedure. The basic principle is as follows:
Suppose that the dataset X contains two types of samples , namely , where represents the number of samples in the class dataset, and D denotes the dimension of the observation space where the dataset is located. Assuming that the intrinsic dimension of the manifold embedded in the dataset is d , the SE-isomap algorithm usually consists of three steps:
- (1)
Constructing the intra-class distance matrix: Calculate the
k-nearest neighbors for each sample
in class
to get
and regard as
. Then, the neighborhood graph
is constructed with the sample
as the vertex and the Euclidean distance
between the sample as the edge. The shortest path distance in the neighbor graph
between the two vertexes will be regarded as an approximation of the geodesic distance between two corresponding samples. For convenience of expression, the geodesic distance
is simplified to
, then the geodesic distance between
and
is as follows:
Then, the intra-class geodesic distance matrix
is constructed according to the approximate geodesic distance between two arbitrary points.
- (2)
Constructing the global discriminative distance matrix: Calculate the inter-class geodetic distance
between any two samples when
, and the computational equation is as follows:
where the sample pairs
;
denotes the shortest euclidean distance between class
and class
;
and
denote the intra-class geodesic distance between
and
,
and
, respectively. In general, the calculative strategy of inter-class geodesic distance
is different when applied to different experimental tasks. Equation (5) is chosen for visualization to ensure the authenticity of the structure of the dataset:
When the classify experiment is implemented to enhance the separability, Equation (6) will be taken into consideration:
The inter-class distance parameter
is used to balance the fidelity of visualization and the separability of data. Then, the inter-class distance matrix can be denoted as
. Finally, the global geodesic distance matrix
G representing the distance between any individual sample points is constructed as:
where
and
are the intra-class geodesic distance matrices of class
and class
, respectively,
and
are the inter-class geodesic distance matrices of the samples belong to different classes and
and parameters
are used to reduce the intra-class distance properly for the reason that the gather effect will be reinforced.
- (3)
Utilizing the explicit-MDS algorithm to obtain the low-dimensional embedding expression and the corresponding mapping relation of the dataset: The explicit-MDS algorithm is applied to obtain the low-dimensional expression and explicit mapping matrix .
2.3. Explicit-MDS
Utilizing the explicit-MDS algorithm on the global geodesic distance matrix
G, the corresponding mapping
can be obtained while obtaining the low-dimensional representation of the data, where
. For any
,
where
denotes the weight coefficient matrix,
is the column vector,
is the optimal low-dimensional expression of the original data in the target space and
is a set of non-liner basis functions. Here, the radial basis function (RBF) is selected to construct the functions:
where
denotes the centers of the RBF
and
denotes the bandwidth parameter determined by the mean distance of the dataset. Correspondingly, the objective function is constructed as:
where
denotes the pairwise Euclidean distance between the constructed points in the target space after dimensionality reduction. The problem is transformed to obtain the optimal weight matrix
W that can minimize the objective function. An iterative optimization algorithm is used to resolve the problem of least squares on
W. The algorithm steps are as follows:
Step 1. Let t = 0, and initialize the weight matrix ;
Step 2. Let V = ;
Step 3. Update
;
is the solution of minimization problem
, and
, where:
is the Moore–Penrose inverse of
A,
Step 4. Check for convergence. If not, then let t = t + 1, and go to Step 2 to continue the iteration; otherwise, stop.
Finally, the optimal expression of the original data in the target space and the corresponding mapping weight matrix are obtained. The nonlinear manifold structure feature of a new test sample can be obtained by the explicit relation matrix , multiplying its base function, that is,, through which the efficiency of the algorithm has been improved.
3. Feature Extraction Method Based on WPD and SE-Isomap
In this section, the feature extraction method is developed with WPD and SE-isomap, and it includes mainly three aspects: effective time segment selection, optimal wavelet packet selection and feature extraction and serial fusion. The flow chart of the proposed method is shown in
Figure 2, in which the KNN classifier is used to evaluate the effectiveness of the MI-EEG features.
3.1. Instantaneous Power Spectra Analysis
In this study, the instantaneous power of each signal is calculated, and the instantaneous power of all signals is averaged to obtain the average instantaneous power spectrum of the signal. Suppose that dataset
, where
N denotes the number of samples,
,
is the
j-th point of the
i-th sample and
D describes the original dimension of the samples. Then, the average power of the
j-th point in the sample set can be calculated by:
By comparing the average power spectrum over the entire time period, the most obvious time period of event-related synchronization (ERS)/event-related desynchronization (ERD) phenomenon [
22] was determined, and the specific time interval was taken as the dataset for subsequent analysis.
3.2. Selection of Optimal Wavelet Packets
The frequency bands and associated powers of the MI-EEG signal, which is collected through the brain and the cortical region related to motion, may change with different imaginary actions and different subjects. The recognition of MI-EEG is primarily based on the ERS/ERD-related motion-sensing rhythm characteristics. Therefore, the wavelet packet transform is the best choice to decompose the MI-EEG signal
that is pre-determined by power spectra analysis, and wavelet packet variance is used to select the subject-based frequency bands or subject-specific wavelet packets. Suppose that the corresponding WPD coefficients to wavelet packet
s(
j,
n) for the sample
are denoted as
, where
l = 3 or 4; it is the label of channel
C3 or
C4;
j is the number of decomposed levels;
n is the number of wavelet packet subspaces in the
j-th level; and
k is the index of the coefficient sequence. According to the wavelet packet structure theory, if the wavelet basis function is orthogonal, the WPD will then comply with the energy conservation principle [
21], that is the square sum of the WPD coefficients on each layer does not change when the number of decomposition layers changes. It can be formulated as:
where
K is the number of wavelet packet coefficients for subspace
s(
j,
n) and
and
represent the energy density and energy of the wavelet packet, respectively. The energy of
n-th wavelet packet subspace is calculated as:
The variance of the
n-th WPS is defined as:
Then, the variance of different wavelet packets at a specific level is obtained.
After j-level WPD of the MI-EEG signal , the WPD coefficients are divided into subspaces s(j,0) ~ s(j,).
Then, according to Equation (17), the variance of the wavelet packet coefficients for each subspace under the specific layer
j is calculated, and the wavelet packet variance of each subspace on this level is computed and added. The mean variance is obtained as follows:
where
denotes the mean value of wavelet packet variance for all samples,
N represents the number of samples and
K is the dimension of wavelet packet coefficients.
For each motor imagery task, the
j-level WPD of MI-EEG
(
i = 1, 2, …,
N) on channels
C3 and
C4 is first completed. After that, the mean wavelet packet variances corresponding to channels
C3 and
C4 are calculated, respectively, for each wavelet packet
s(
j,
n) (
), and the difference between the mean wavelet packet variances is computed and sorted in the descending order. The partial wavelet packets referring to the top
m mean variance difference are thought of as the OWP; this means they will have the most significant distinction for each motor imagery task. All of the OWP coefficients are constructed as a vector and denoted as:
where
are the indexes of the selected OWP subspaces.
3.3. Feature Extraction
3.3.1. Statistical Features Based on OWP
A series of WPD coefficients will be obtained after the MI-EEG signal is decomposed by WPD, and they carry the important time-frequency feature information. Here, the OWPs with the top
q mean variance difference are found, and their coefficients are used in the calculation of statistical features. Let
be the decomposition coefficients of the
s(
j,
n) subspace for the sample
over channel
Cl, then their statistics are defined as follows:
where
represents the mean of the coefficients in the frequency band corresponding to the
s(
j,
n) subspace and
and
denote the mean energy and mean standard deviation, respectively.
The ERS/ERD phenomenon can be expressed by
C3 and
C4 electrodes when the brain is implementing the imagination of left or right hand movement. To have the maximum separability for two motor imagery tasks, the time-frequency characteristics are defined as follows:
where
denote the indexes of the selected OWPs.
3.3.2. Non-Liner Structure Feature with SE-Isomap
Suppose that the OWP coefficient vectors
(
i = 1, 2, …,
N) have been achieved from
Section 3.2 for each class
, which is really one of two motor imagery tasks. Then, the intra-class and inter-class geodesic distance between any two coefficient vectors are calculated according to Equations (3) and (4), and they are constructed as the matrices
,
and
. Therefore, the global geodesic distance discriminant matrix
can be obtained.
Next, the explicit MDS algorithm is applied to obtain the explicit mapping matrix by an iterative optimization algorithm and a low-dimensional representation of the training samples according to the global discriminant distance matrix G from the previous step. Then, the radial basis function of the test sample is obtained according to Equation (9). Finally, the low-dimensional representation of the test sample is obtained according to . The new nonlinear feature, representing the manifold structure, is denoted as , where d represents the feature dimension.
3.3.3. Feature Fusion
To take full advantage of the information obtained from multiple aspects, features
and
are integrated by using a serial connection, and they will be standardized before feature fusion to prevent the uneven weight caused by the difference of magnitude for two types of features. The equation is as follows:
where
denotes the two normal form.
3.4. Feature Evaluation
In the proposed extraction method, SE-isomap, as a nonlinear dimensionality reduction technology, is based mainly on the intrinsic geometry of the dataset; the extracted features contain the intrinsic geometry information of the data, and the geodesic distance between the features are the basis of the intrinsic geometry of the dataset. Therefore, the KNN algorithm with geodesic distance has the most direct and simple idea, and it will be the best choice for feature evaluation.