1. Introduction
Emotions encompass diverse types of subjective mental experiences, playing an important role in daily life, influencing human decision-making, planning, and other psychological states [
1]. With the rapid development in the areas of computer technology, artificial intelligence, wearable technology, and information fusion technology, machines and computers have the ability to understand, recognize, and analyze emotions [
2]. Emotion recognition has a broad range of applications, such as human–computer interaction [
3], epilepsy detection [
4], military and aerospace applications [
5], and depression detection [
6].
Emotions are usually expressed through physiological responses and behaviors. Compared to behaviors such as facial expressions, vocalizations, and body movements, physiological signals are regarded as offering more precise reflections of a subject’s emotions, as they are beyond the direct control of any subject’s intentions [
7]. Among the various physiological signals, electroencephalogram (EEG) measurements are noninvasive, cost-effective, easy to perform, and have a high temporal resolution [
8]. Therefore, a growing number of researchers are considering utilizing EEG signals for emotion recognition.
However, traditional EEG measurement devices are primarily designed for medical and research applications. These devices are generally bulky, cumbersome, and operationally demanding, which limits their applicability for daily use in real life [
9]. To address these shortcomings, many wearable EEG devices based on dry electrodes are being adopted in research on emotion recognition. For example, Diaz et al. [
10] acquired eight-channel EEG signals and motor data to analyze the emotional state of subjects while playing a whack-a-mole game. Ma et al. [
11] presented a novel portable EEG signal acquisition and analysis system with a 10-channel dry electrode device. The system is supported by a channel selection network based on Squeeze-and-Excitation (SE) attention and multi-scale convolution to enhance classification performance. These studies demonstrate that with the advancement of wearable technology, the potential of emotion recognition in natural settings using dry-electrode wearable EEG devices is increasingly being realized.
Designing effective discriminant models, which may include deep learning and machine learning techniques, is crucial for EEG-based emotion recognition [
12]. Deep learning models can not only further learn based on manually extracted features, but also automatically abstract characteristics from raw EEG signals and train the discriminator at the same time. When dealing with emotion classification tasks, deep learning models can effectively mine the deep feature representation of EEG data and are widely used due to their excellent performance [
13].
Deep learning models, such as Convolutional Neural Networks (CNNs), are capable of learning localized feature representations and achieve promising results in EEG classification tasks. For example, Song et al. [
14] developed a dynamic graph convolutional neural network (DGCNN). This network effectively extracts the cross-channel interactions in EEG signals. Cheng et al. [
15] proposed MSDCGTNet, which extracts the spectral, spatial, and temporal information of EEG signals through CNNs and a gated transformer encoder. Jin et al. [
16] constructed a novel network called the pyramidal graph convolutional network (PGCN), which can integrate features at local, mesoscopic, and global levels, while analyzing structural and functional connectivity at each level. Jiménez et al. [
17] presented IL2FS, which incorporated weight alignment, margin ranking loss, and triplet loss to maintain inter-class discriminability and feature space alignment for known classes. Han et al. [
18] presented a new model combining a multi-scale convolution and a TimesNet network. This model employs various convolutional kernels to capture dynamic spatio-temporal relationships, and subsequently TimesNet is utilized to build 2D sequences and learn intricate temporal features. Tang et al. [
19] designed an emotion recognition framework based on an Efficient Capsule Network with Convolutional Attention (ECNCA). They initially concatenated and fused features from specific frequency bands of EEG signals, then used ECNCA to augment the input data with CNN and attention mechanisms, and finally applied Efficient Capsule to categorize the emotions.
Nevertheless, these methods rely on more EEG channels, which means reduced comfort and higher hardware costs. In addition, using multi-channel EEG signals complicates the processing algorithms, constraining their applications for real-time scenarios, particularly in environments requiring low-latency responses. Thus, it is necessary to develop emotion recognition methods based on sparse-channel EEGs for applications in natural environments. As illustrated in
Figure 1, reducing the number of EEG channels significantly facilitates signal acquisition using wearable devices, improving comfort and practicality in real-world applications. Based on the left–right brain asymmetry theory, the sparse channels employed in this study are selected from the left and right hemispheres (i.e., FP1, FP2, F3, and F4). All the electrodes corresponding to these channels are located in the prefrontal cortex (in accordance with the international 10–20 system), unobstructed by hair, and have been shown to be closely related to emotions.
Cross-subject emotion recognition is emerging as a future trend due to its ubiquity and wider applications [
20]. Shi et al. [
21] proposed an EEG cognitive recognition method based on a spiking neural network, in which a sample-based adaptive thresholding strategy was introduced to improve generalization ability. Li et al. [
22] put forward a cross-subject emotion training method based on a spatial–temporal graph attention network, which could capture time–frequency information and intrinsic relationships between different EEG channels to achieve higher classification accuracy. Li et al. [
23] presented the fusion model of a multi-scale residual network (MSRN) and a meta-transfer learning (MTL) strategy. The MTL strategy combines the advantages of meta-learning and transfer learning to reduce individual differences among subjects. Shen et al. [
24] constructed a network called CLISA based on contrastive learning with a training phase incorporating a specially formulated loss function. The network combines temporal and spatial convolution and eventually obtained 47% accuracy on the THU-EP dataset. Ding et al. [
25] developed a new transformer method called emotion transformer (EmT) which consisted of a residual multi-view pyramid GCN module and a temporal contextual transformer. The method attained an accuracy of 59.5% on the THU-EP dataset. In addition, domain-adaptive methods in transfer learning can effectively eliminate inter-subject neural representation differences. These methods achieve robust decoding of cross-subject emotional states. For example, Quan et al. [
26] introduced a new feature extraction method called MR-VAE, as well as a multi-source domain transfer learning model for selecting the optimal transferable samples from both global and subdomain distributions. Ni et al. [
27] proposed CTDDSR, which combines migration learning, dictionary learning, and linear discriminant analysis techniques. The model enables different domains to utilize a shared dictionary, extract details from the EEG signals of current domains, and transfer the knowledge to a new domain, which effectively minimizes labeled data demand for target subjects and alleviates the differences between the source and target domains.
Existing domain adaptation methods typically require explicit source–target alignment and simultaneous access to both domains, which is impractical for wearable EEG applications. On the other hand, such methods are sensitive to severe inter-subject variability and sparse-channel constraints. Furthermore, incremental learning approaches have mainly focused on class-incremental scenarios rather than addressing the subject differences in cross-subject emotion recognition. In addition, most existing incremental learning frameworks are not designed to work with feature extraction architectures tailored to sparse-channel EEG signals.
To mitigate the mentioned limitations, we propose a cross-subject emotion recognition model (TSCL-LwF) based on a sparse-channel EEG, which combines a multi-scale convolutional network (TSCL) and an incremental learning strategy with LwF. Specifically, in the TSCL model, a multi-scale architecture is used to capture the temporal dynamics and inter-channel spatial correlations of the local prefrontal region (with all sparse-channel readings confined to the prefrontal area), while its separable convolutional layers model inter-channel dependencies by fusing multi-level features extracted from the sparse-channel EEG signals. The incremental learning strategy with LwF introduces a limited set of labeled target domain data to enable the TSCL-LwF model to rapidly adapt to the target domain data distribution. Meanwhile, LwF leverages the knowledge distillation loss to retain prior knowledge learned from source subjects to reduce the differences between subjects, thus enhancing the generalization ability and recognition accuracy of cross-subject emotion recognition.
The contributions of this paper are as follows:
- 1.
A new multi-scale convolutional model called TSCL is proposed, which utilizes a multi-scale architecture and separable convolutional layers to ensure spatio-temporal feature extraction and interaction, rendering it especially effective for capturing domain-invariant characteristics from sparse-channel EEG signals.
- 2.
An incremental learning strategy with LwF is introduced. It utilizes a limited set of labeled target domain data to enable TSCL-LwF to rapidly adapt to the target domain, while leveraging knowledge distillation loss to reduce the differences between subjects. This strategy enhances the performance of cross-subject emotion recognition.
- 3.
Extensive experiments on the DEAP and EPPVR datasets verified the effectiveness of TSCL-LwF, and it obtained better performance for valence, arousal, and valence–arousal classification with a sparse-channel EEG compared to other methods.
The rest of this article is organized as follows:
Section 2 introduces the proposed TSCL-LwF.
Section 3 presents the datasets and experimental details.
Section 4 elaborates the experimental results and analysis. A discussion of the results is given in
Section 5, and
Section 6 offers the conclusion of the paper.