2.1. DEAP Database
This study was performed on the publicly available database DEAP [
18], which consists of a multimodal dataset for the analysis of human emotional states. A total of 32 EEG channels and eight peripheral physiological signals of 32 subjects (aged between 19 and 37) were recorded whilst watching music videos. The 40 one-minute long videos were carefully selected to elicit different emotional states according to the dimensional, valence-arousal, emotion model. The valence-arousal emotion model, first proposed by Russell [
19], places each emotional state on a two-dimensional scale. The first dimension represents valence, ranged from negative to positive, and the second dimension is arousal, ranged from calm to exciting. In DEAP, each video clip is rated from 1 to 9 for arousal and valence by each subject after the viewing, and the discrete rating value can be used as a classification label in emotion recognition [
20,
21,
22].
In this study, for simplicity, the preprocessed DEAP dataset in MATLAB format was used to test our channel selection algorithms for classifying four emotional states (joy: valence ≥5, arousal ≥5; fear: valence <5, arousal ≥5; sadness: valence <5, arousal <5; relaxation: valence ≥5, arousal <5). In the preprocessing procedure, the sampling rate of the EEG signal was down sampled from 512 Hz to 128 Hz and a band pass frequency filter from 4.0–45.0 Hz was applied. In addition, electrooculography (EOG) artifacts have been removed from the EEG signal. For each subject, there are 40 one-minute trials which were divided into four categories labeled as joy, fear, sadness or relaxation, separately. In our experiment, about half of the trials for each category were randomly selected for channel selection (channel-selection dataset), and the rest were used to test the performance of the channel selection results (performance-validation dataset). To make sure that there are relatively enough data in every category for the channel selection and the test, only the subjects in their dataset, with a number of trials for every category that is no less than 5, were considered. Thus, sixteen subjects (1, 2, 5, 7–11, 14–19, 22, 25) were chosen. Candra et al. [
23] reported that the effective window size for arousal and valence recognition using the DEAP database was between 3–10 and 3–12 seconds respectively. So, each 60-s trial was segmented into 15 4-s samples with non-overlapping for the purpose of increasing the number of samples. Finally, we got a total of 600 (40 trials × 15 samples) samples for each subject. All the samples derived from the same trial share the same category label.
2.2. Feature Extraction
Different types of features have been used in EEG-based emotion recognition, including time, frequency or time-frequency domain features [
24]. However, power features from different frequency bands are still the most popular in the context of emotion recognition. In this work, a series of band pass filters were used to translate the raw EEG data from 32 channels of each sample to the ta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz) and gamma (30–45 Hz). Then, the power of every specific frequency band was calculated using a 512-point fast Fourier transform (FFT), so 128 (32 channels × four features) features were obtained for each sample.
The power of a specific frequency band corresponding to channel
T in sample
R is calculated as
where,
is the FFT of the EEG signals for channel
T in sample
R,
N is the length of FFT and equals the sample length 512 (4 s). Z-score normalization was adopted for each feature [
25]. For a feature
fR belonging to sample
R, the normalized value was computed by
where,
µf and
σf are, respectively, the mean and the standard deviation of feature
f over all samples.
2.3. Channel Selection Based on ReliefF
ReliefF is a widely used feature selection method in classification problems, due to its effectiveness and simplicity of computation. A key idea of ReliefF is to evaluate the quality of features according to their abilities to discriminate among samples that are near to each other. The ability is quantified as a weight of every feature. Channel selection can be performed based on the results of ReliefF feature selection, just as the typical feature-selection-based channel selection method aforementioned.
In our work, 128 features were used to discriminate four emotional states (joy, fear, sadness, relaxation) with the channel-selection dataset—about 300 samples of every subject. Firstly, a ReliefF algorithm was used to rank all the 128 features according to their weights. At the beginning, all the weights for 128 features were set to zero. For each sample
Ri, the
k nearest neighbors from the same class of
Ri (nearest hits
Hj) and
k nearest neighbors from each of the different classes (nearest misses
Mj(
C)) were found in terms of the features’ Euclidean distance between two samples. Then, the weights
W(
F) for all features were updated according to Equation (3). The above process was repeated until all the samples in the channel-selection dataset had been chosen, and the final
W(
F) was our estimations of the qualities of 128 features. The whole process is implemented with the following steps:
set all the weights of 128 features to zero, ;
for i = 1 to the number of samples in the channel-selection dataset do
select a sample Ri;
find k nearest hits ;
for each class C ≠ class(Ri) do
find k nearest misses Mj(C)(j = 1, 2, …, k) from class C;
end;
end;
end;
where, the nearby neighbor’s number
k is 10, which is safe for most purposes [
17] and the prior probability of class
C is
P(C) (estimated by the channel-selection sets). Function
calculates the difference between values of the feature
f for two samples
R1 and
R2, which is defined as:
where,
value(
f,
R) is the value of feature
f in sample
R, max(
f) and min(
f) are respectively the maximum and minimum of the feature
f over all samples.
Through the above described process, it can be easily understood that a large weight means the feature is important to discriminate between samples and a small one means that it is less important. Therefore, all features can be ranked in terms of their weights. While selecting channels, the top-N features were chosen according to the rank of every feature, and then the channels containing these features were selected. Although the reduction of features usually led to a reduction of channels involved, the actual effect was not obvious [
12].
However, the ultimate goal of sensor channel selection is to achieve the best accuracy for classification by using the least number of sensors. From this perspective, a natural strategy is evaluating the importance of channels directly for classification, treating a channel rather than a feature as a unit. If we accept that the weight of a feature obtained from ReliefF reflects its capability to discriminate different classes, we can use the weights of all features belonging to a channel to estimate the channel’s contribution to class discriminability.
We defined the mean of the weights of all features belonging to a channel as this channel’s weight. In our work, 32 channels were used and each channel contained four features. The weight of channel
T is computed as
where,
W(
fi) is the weight of the
i-th feature (
fi) belonging to channel
T, and
N is the number of features of channel
T. Then, channels can be ranked according to their weights, and the best channels for the classification task can be easily selected. This method was denoted as mean-ReliefF-channel-selection (MRCS).
Now, we can select channels, independent of a classifier, for a further classification task. It is perhaps effective for different kinds of classifiers, but it cannot be guaranteed to be optimal for a specific classifier. As mentioned in
Section 1, a search algorithm can find the optimal channel set for a specific classifier, but it is computationally expensive. We propose a strategy to iteratively adjust the weights of channels according to their contribution to classification accuracy for a specific classifier, as in
Figure 1. At the beginning, the weight of every channel is initialized to the mean of the ReliefF weights of all features belonging to this channel as
W0(
T). Then, on each subject’s channel-selection dataset, the average accuracy over a varying number of channels is obtained using a specific classifier by adding the channels one by one according to their weights, labeled as S
0(
n) (the average classification accuracy, when top-n channels are used,
n = 1 to 32). Then the contribution of the top
n-th channel in the
i-th iteration can be computed as
this contribution value of a channel could be positive or negative. Then, the weight of channel
T is updated as
where,
nT are the rank of channel
T in all 32 channels. It means that the weight of a channel increases when its contribution is positive and decreases in the reverse case. The above process is repeated until the absolute value of the max negative contribution of all channels is less than ε or iterates 50 times. In this paper, ε equals 0.01. Then, the importance of channels will be evaluated and selected according to the final weights
W. The purpose of this method is to further optimize the channel selection performance against a specific classifier based on MRCS, denoted as X-MRCS (X can be replaced with classifier name, such as SVM-MRCS).