Image-Evoked Emotion Recognition for Hearing-Impaired Subjects with EEG Signals

In recent years, there has been a growing interest in the study of emotion recognition through electroencephalogram (EEG) signals. One particular group of interest are individuals with hearing impairments, who may have a bias towards certain types of information when communicating with those in their environment. To address this, our study collected EEG signals from both hearing-impaired and non-hearing-impaired subjects while they viewed pictures of emotional faces for emotion recognition. Four kinds of feature matrices, symmetry difference, and symmetry quotient based on original signal and differential entropy (DE) were constructed, respectively, to extract the spatial domain information. The multi-axis self-attention classification model was proposed, which consists of local attention and global attention, combining the attention model with convolution through a novel architectural element for feature classification. Three-classification (positive, neutral, negative) and five-classification (happy, neutral, sad, angry, fearful) tasks of emotion recognition were carried out. The experimental results show that the proposed method is superior to the original feature method, and the multi-feature fusion achieved a good effect in both hearing-impaired and non-hearing-impaired subjects. The average classification accuracy for hearing-impaired subjects and non-hearing-impaired subjects was 70.2% (three-classification) and 50.15% (five-classification), and 72.05% (three-classification) and 51.53% (five-classification), respectively. In addition, by exploring the brain topography of different emotions, we found that the discriminative brain regions of the hearing-impaired subjects were also distributed in the parietal lobe, unlike those of the non-hearing-impaired subjects.


Introduction
Emotion is the attitude and experience generated after human beings compare objective things with their needs. It reflects people's current physiological and psychological state, which plays a crucial role in people's cognition, communication, and decision making [1]. Researchers believe that any emotion produced by human beings must be accompanied by some physical changes, such as facial expression, muscle contraction and relaxation, and visceral activities. How to understand and express emotions is affected by a human's ability to perceive the outside world and express themselves.
With the development of emotional computing and artificial intelligence, the study of emotion recognition via electroencephalogram (EEG) has been paid more attention. In recent years, EEG-based emotion recognition has mainly focused on healthy people and achieved good results [2]. However, few studies have examined groups such as hearingimpaired people [3,4]. Some researchers believe that due to hearing loss, hearing-impaired people find it difficult to receive information from the outside world as fully and accurately as hearing people [5,6]. As a result, they may have cognitive bias which can cause them to interpret interactions excessively negatively, producing interpersonal cognitive bias [7]. Therefore, it is of great importance to study the facial emotion recognition ability of hearingimpaired people to examine and, if desired by the individuals concerned, aid their social adaptability. In order to aid the facial recognition ability of hearing-impaired subjects where desired, it is necessary to understand the differences in their recognition of the emotions of different faces. Therefore, in this paper, the emotions of non-hearing-impaired subjects and hearing-impaired subjects were induced by Chinese pictures of faces showing certain emotions, and their EEG signals were collected for emotion recognition to analyze the cooperative working mechanisms of the brain regions of hearing-impaired subjects when recognizing facial emotions.
Feature extraction is particularly important in EEG emotion recognition [7]. EEG signals comprise high-dimensional temporal data with a low signal-noise ratio [8,9]. Therefore, after effective feature extraction, EEG features containing rich information can improve the ability to distinguish types of emotion in limited-feature spatial dimensions [10]. Time domain features were first used because of their strong intuitiveness and good recognition effect for specific waveform signals. Fourati et al. [11] proposed the echo state network (ESN), which used recurrent layers to project original EEG signals into high-dimensional state space, and achieved satisfactory results. Frequency domain features are based on Fourier transform, and the distribution of signal power along frequency is obtained by transforming the time domain to the frequency domain. This has been proven to be efficient in EEG emotion recognition using methods such as power spectral density (PSD) and differential entropy (DE) [12]. Duan et al. [13] applied DE features to EEG signals in emotion recognition and found that DE features had a high recognition rate and stability for positive and negative emotion recognition. Li et al. [14] proposed a spatial feature extraction method to capture spatial information between different channels using a hierarchical convolutional neural network based on a two-dimensional EEG graph. Although existing emotion recognition methods have achieved high accuracy, most of them only consider a single feature or a combination of two features, ignoring the complementarity between different features. EEG emotion recognition is essentially a pattern recognition problem aiming to embed highly discriminative emotion features into EEG signals and improve the accuracy of emotion recognition. Early emotion recognition models are mainly shallow machine learning models such as support vector machine [15] (SVM), the k-nearest neighbor algorithms [16] (KNN), and so on. Among them, the SVM algorithm is widely used because of its unique advantages in solving small-sample, high-dimension, and nonlinear machine learning tasks. Bhardwaj et al. [17] extracted PSD features and used SVM and the linear discriminative analysis (LDA) algorithm for emotion recognition. The results show that SVM is superior to LDA in emotion classification. In terms of deep learning, Cimtay et al. [18] proposed using a pre-trained CNN to extract features, which eliminates the influence of voltage amplitude fluctuation through data normalization and adds a pooling layer and a fully connected layer to the pre-trained network, thus improving the classification performance of the network. Song et al. [19] further modeled multichannel EEG data into graph data with the basic units of an electroencephalogram, providing a new perspective for analyzing EEG data. They designed the Dynamic Graph Convolutional Neural Network (DGCNN). Based on the learned adjacency matrix, the model can complete the feature propagation between electrodes and learn more discriminative features to improve emotion recognition. Previous studies have only focused on local features affecting the capacity and generalization of the whole model but ignored global features. Therefore, starting from effectively integrating global and local interactions, this paper focuses on balancing model capacity and universality in order to improve model performance.
In this paper, we selected the Chinese-style facial expression pictures from CFAPS as stimuli to induce subjects' emotions by displaying corresponding facial expression pictures. Finally, 20 non-hearing-impaired subjects and 20 hearing-impaired subjects were collected based on EEG induced by five emotional images. Besides, the multi-axis self-attention module composed of the window attention module and grid attention module effectively combines local and global features to improve model classification performance and reduce model computational complexity. Promising classification results are obtained on the datasets of hearing-impaired subjects and non-hearing-impaired subjects, respectively, which proves the validity of the classification model. By drawing a brain topographic map based on DE features, we further compare and analyze the differences in brain region energy in emotion recognition between hearing-impaired subjects and non-hearingimpaired subjects. Compared with non-hearing-impaired subjects, the changes in the emotion recognition of hearing-impaired subjects are not only concentrated in the temporal lobe but also distributed in the parietal lobe. The main contributions of this study are given as follows.
(1) We have constructed an emotional EEG dataset using facial expression picture stimuli, featuring both non-hearing-impaired and hearing-impaired subjects. (2) To facilitate fusion, we have developed two novel constructs-the subtract symmetric matrix (SSM) and the quotient symmetric matrix (QSM)-based on the original signal and DE feature. These matrices have been designed keeping in mind the electrode positions in the widely recognized international 10-20 system, and take into account the nature of the electrode pairs in symmetrical positions. SSM quantifies the difference between the eigenvalues of the left and right brain regions by measuring the characteristic differences of 27 pairs of symmetric electrodes. Similarly, QSM computes the difference between the eigenvalues of the left and right regions of the brain using the same methodology. (3) Our work puts forth a groundbreaking multi-axis self-attention mechanism for the recognition of emotions through EEG among both non-hearing-impaired and hearingimpaired individuals. Our multi-axis self-attention mechanism is composed of two modules that run in parallel: a window attention module which focuses on local features of the EEG signal, and a global attention module which extracts the global features.
The rest of the paper is summarized as follows. In Section 2, we introduce the experiment setup in detail. Section 3 describes the proposed feature extraction and emotion classification method. The experimental results are analyzed in Section 4. Section 5 discusses this work. Finally, we summarize the paper and look forward to future research development in Section 6.

Materials
Current methods of emotion induction mainly stimulate subjects through pictures [20], audio [21], and video [22] to obtain corresponding emotional EEG signals. According to the mirror neuron theory [23], when people observe another person performing a certain activity, their own brain activity is also as if they were performing the process. In addition, the mirror neuron system is also involved in recognizing emotions through other people's facial expressions and gestures. Therefore, this experiment is designed to induce corresponding emotions by showing facial expressions of different emotions to subjects. The EEG signal of 20 hearing-impaired college students and 20 non-hearing-impaired college students were collected when they viewed the pictures of five different emotional faces under the same experimental materials and environment. Experimental preparation mainly includes recruiting subjects and selecting stimulus materials, and introducing experimental procedures. The Ethics Committee of Tianjin University of Technology has approved this experiment for emotion recognition research (May 2022).

Subjects
In this work, twenty subjects (7 females and 13 males) with an average age of 22 years old were recruited from the School of Audiology and Artificial Sciences of Tianjin University of Technology for an emotion induction experiment. We only conducted subject-dependent experiments-the training dataset and test dataset were from the same subject-so that gender did not influence the experimental results. Basic personal information and auditory information were collected before participating in the experiment. The subjects had uncorrected or corrected vision, were right-handed, and had no history of mental illness. All of the subjects had hearing loss in both ears, 4 subjects had congenital disorders, and 18 subjects wore hearing aid devices. The situation of 20 hearing-impaired subjects was shown in Table 1. In addition, we also recruited 20 students (14 males and 6 females, aged 18-25 with an average age of 22) from the School of Electrical Engineering and Automation to participate in the experiment. The 20 subjects with unimpaired hearing were righthanded and had no history of mental illness. The 40 participants all provided informed consent approved by the Ethics Committee of Tianjin University of Technology. Before the experiment, all subjects were informed about the purpose of the experiment and the harmlessness of the EEG collection equipment. If there were special circumstances, including but not limited to fatigue, environmental suitability, and discomfort caused by the stimulus images, all subjects were permitted to terminate and withdraw from the experiment.

Stimulation Materials Selection
In our previous research, we have found that small changes in the facial expression of hearing-impaired subjects during video stimuli watching are particularly indicative of emotional changes. In order to verify whether this conclusion is still valid under different emotional induction methods, this study used facial expression pictures as stimuli, which contain rich information about facial changes.
The stimulus material of the experiment was 240 emotional images of faces selected from the Chinese Facial Affective Picture System [24] (CFAPS). CFAPS, as a localized facial expression image system, contains a total of 870 facial expression images of seven emotional types. A total of 100 college students evaluated the emotional type of each image and give a score of 1-9 on the emotional intensity expressed based on the assessed emotional type. To ensure that the emotional images of faces in the system had a better effect on emotional arousal and practicability, we selected 5 types of facial expression pictures, including 36 angry faces, 36 fearful faces, 36 sad faces, 36 neutral faces, and 36 happy faces, as experimental stimulus materials. The emotional diagram of faces is shown in Figure 1. All images had a resolution of 260 × 300, and we used a 17-inch monitor to display emotional images. With reference to the SEED dataset [25] our dataset can be tested for three-classification. We assign sadness to negative emotions, happiness to positive emotions, and neutral to neutral emotions. The using of more images of sad, neutral, and happy expressions is to increase the sample size in the three-classification task. No clothing or accessories were included in the images, and the parameters remained the same for all faces in the experiment.

Experimental Paradigm
In this experiment, the NeuSen W 364 EEG acquisition system developed by Neuracle (Changzhou, China), was used to collect the subjects' EEG signals. The device collects EEG signals at a sampling rate of 1 k Hz and has 64 electrode channels. According to the international 10-20 system, TP9 and TP10 are used as reference electrodes for bilateral mastoid sensors.
The experiment was conducted in an isolated environment. Figure 2 shows the experimental paradigm. First, subjects were shown a black screen for 5 s to allow them to calm down their emotions and adjust their state. A five-second countdown was shown to prepare the subjects, and a cross was shown to remind them to pay attention. An emotional picture of a face was then displayed for five seconds and the subjects' EEG signals were recorded. A test was conducted before the formal experiment, all subjects can ask questions at any time. The experimenters were prepared to help solve any problems that may arise during the experiment. The test procedure was the same as the formal experiment. Subjects were advised to stay as still as possible, understand emotions rather than simply mimic facial expressions, and try not to blink when the picture was shown, which would help to obtain noiseless EEG signals. The experiment was divided into four rounds; 60 trials were conducted in each round, among which 15 trials were in one group. After finishing each group of experiments, there would be 1-2 min of rest time and 5 min of rest time between each round of experiments. With reference to the SEED dataset [25] our dataset can be tested for three-classification. We assign sadness to negative emotions, happiness to positive emotions, and neutral to neutral emotions. The using of more images of sad, neutral, and happy expressions is to increase the sample size in the three-classification task. No clothing or accessories were included in the images, and the parameters remained the same for all faces in the experiment.

Experimental Paradigm
In this experiment, the NeuSen W 364 EEG acquisition system developed by Neuracle (Changzhou, China), was used to collect the subjects' EEG signals. The device collects EEG signals at a sampling rate of 1 k Hz and has 64 electrode channels. According to the international 10-20 system, TP9 and TP10 are used as reference electrodes for bilateral mastoid sensors.
The experiment was conducted in an isolated environment. Figure 2 shows the experimental paradigm. First, subjects were shown a black screen for 5 s to allow them to calm down their emotions and adjust their state. A five-second countdown was shown to prepare the subjects, and a cross was shown to remind them to pay attention. An emotional picture of a face was then displayed for five seconds and the subjects' EEG signals were recorded. A test was conducted before the formal experiment, all subjects can ask questions at any time. The experimenters were prepared to help solve any problems that may arise during the experiment. The test procedure was the same as the formal experiment. Subjects were advised to stay as still as possible, understand emotions rather than simply mimic facial expressions, and try not to blink when the picture was shown, which would help to obtain noiseless EEG signals. The experiment was divided into four rounds; 60 trials were conducted in each round, among which 15 trials were in one group. After finishing each group of experiments, there would be 1-2 min of rest time and 5 min of rest time between each round of experiments.

Methods
The collected EEG signal needed to be preprocessed, and the input of the deep learning model can be obtained by feature extraction. Four eigenmatrices (SSM(OEF), QSM(OEF), SSM(DE), QSM(DE)) were extracted based on the asymmetric characteristics

Methods
The collected EEG signal needed to be preprocessed, and the input of the deep learning model can be obtained by feature extraction. Four eigenmatrices (SSM(OEF), QSM(OEF), SSM(DE), QSM(DE)) were extracted based on the asymmetric characteristics of the left and right brain regions. The model proposed in this paper first fuses four feature matrices and then obtains two deep features through local and global attention modules, respectively. The two deep features are fused again to obtain the feature vector with global and local representativeness, which is sent into the full connection layer to get the classification output of the model. This section introduces the EEG signal pre-processing method, EEG emotion feature extraction method, and the construction of a classifier.

Data Pre-Processing
The EEG signals collected by experiments are inevitably mixed with various interference signals and artifacts, which need to be removed by signal pre-processing to improve the performance of subsequent emotion recognition. We used the EEGLAB toolbox to process the EEG signal. Firstly, the original EEG signal was down-sampled to 200 Hz and then we removed low-frequency drift and high-frequency noise by 1-75 Hz band-pass filter, and power frequency interference was eliminated by 49-51 HZ band-pass filtering. The bilateral mastoids of TP9 and TP10 were used as references, and then we interpolated the data to repair bad derivatives. Finally, independent component analysis (ICA) was used to remove artifacts.

Feature Extraction
It is necessary to extract features from the pre-processed EEG signals to characterize the emotion-related information in the EEG signals. In this section, four different feature matrices (subtract symmetry and quotient symmetry matrix of the original signal and DE feature) will be constructed according to electrode positions in the international 10-20 system as the input of the classification model. The construction process of the eigenmatrix is shown in Figure 3. In this paper, , , … , ∈ is defined as the EEG sample containing time T, E (=62) is the number of electrodes, and , , … , ∈ ∈ 1,2, … , represents the EEG signal of all electrodes collected at time T. Then, will be converted into a two-dimensional time-domain matrix ∈ , namely, the 9 × 9 matrix we construct, as shown in the following formula: In this paper, . . , T}) represents the EEG signal of all electrodes collected at time T. Then, O T t will be converted into a two-dimensional time-domain matrix M T t ∈ R H×W , namely, the 9 × 9 matrix we construct, as shown in the following formula: Differential entropy (DE) is a feature extraction method widely used in the field of EEG emotion recognition. It is an extension of Shannon entropy. For an EEG signal with a probability density function, DE can be approximately equal to the logarithm of the power spectrum at a certain frequency band. The calculation formula of electroencephalogram DE is as follows: where f (x) is the probability density function obeying N (µ, σ i 2 ). It can be seen from the formula that, for EEG sequences of the same length, the differential entropy in a frequency band is equivalent to the logarithmic value of its energy in the frequency band.
In order to construct the DE feature matrix, the differential entropy features of five frequency bands (δ represents the EEG of all electrodes collected in band b. Then, S S b will be converted into a two-dimensional DE feature matrix M S b ∈ R H×W , as shown in the following formula: where M S b is the 9 × 9 matrix we constructed, other positions in the same matrix are set to 0, and normalization processing is carried out.
M T t and M S b represents the 2-D matrix of the original signal and DE after we flattened the electrodes to a 9 × 9 map. Symmetry subtraction refers to the difference between the feature values of the left and right brain symmetrical electrodes. In the 10-20 system, there are 27 pairs of symmetrical electrodes, as shown in Figure 4. Left brain electrodes are in the blue frame, right brain electrodes are in the red frame, and the middle electrode is removed during data processing. 0, and normalization processing is carried out. and represents the 2-D matrix of the original signal and DE after we flattened the electrodes to a 9 × 9 map. Symmetry subtraction refers to the difference between the feature values of the left and right brain symmetrical electrodes. In the 10-20 system, there are 27 pairs of symmetrical electrodes, as shown in Figure 4. Left brain electrodes are in the blue frame, right brain electrodes are in the red frame, and the middle electrode is removed during data processing.
where represents the electrode pair difference corresponding to the sampling point ∈ 1, , represents the electrode pair difference corresponding to the sampling point in frequency band b∈ 1, , and represents the electrode serial number after removing the channel in the middle position.
The subtract symmetric matrix (SSM) based on the original signal and DE are respectively constructed by the above methods, as shown in the following formula: We find the 27 pairs of symmetric electrodes and construct the subtract symmetric matrix using the following formula: where d i t represents the electrode pair difference corresponding to the sampling point l ∈ [1, f ], d i b represents the electrode pair difference corresponding to the sampling point in frequency band b ∈ [1, B], and i represents the electrode serial number after removing the channel in the middle position.
The subtract symmetric matrix (SSM) based on the original signal and DE are respectively constructed by the above methods, as shown in the following formula: The symmetry quotient feature is the quotient of the feature values of the left and right brain symmetric electrodes. The same as the SSM, it is necessary to find out the symmetric electrodes and eliminate the electrodes in the middle position. Use the following formula to construct the symmetric quotient matrix: where q i t represents the electrode pair quotient value corresponding to the sampling point l ∈ [1, f ],q i b represents the electrode pair quotient value corresponding to the sampling point in frequency band b∈ [1, B], and i represents the electrode serial number after removing the channel in the middle position.
According to (6), the quotient symmetric matrix (QSM) based on the original signal and DE show in the following formula: In order to avoid the influence of too large a numerical difference on the subsequent processing, we carried out normalization processing on each two-dimensional matrix constructed.

Classification Network Construction
In the previous section, we introduced the extraction process of the feature matrix, and in this section, we focus on the proposed model. The model proposed by us takes four dimensions as the input of the feature matrix. Firstly, through the feature fusion network, the fusion feature matrix of dimension (200, 64, 64) is obtained. Then, the fusion feature matrix is fed into the multi-axis attention module which is composed of the global attention module and local attention module. Finally, the output of the global attention module and the local attention module are fused again and fed into the classification network to get the classification results of the model. The model proposed in this paper is introduced in the following two parts: feature fusion network and classification model.

Feature Fusion Network
Different features contain different emotional information. In order to consider the complementarity of multiple features, four different feature matrices are sent into the feature fusion network for feature fusion. The fusion network mainly includes 1 × 1 convolution layer, normalization layer, and ReLU activation function. The convolution of 1 × 1 can unify the channel dimensions without changing the dimension of the eigenmatrix, and the activation function can solve the problem of insufficient linear model capability and possible gradient explosion. In addition, in order to capture more details of the feature matrix and optimize the classification effect, we used cubic spline interpolation on the matrix [26]. The matrix dimension based on the original signal is (200, 64, 64), and the matrix dimension based on DE is (5, 64, 64). By up-sample and down-sample, respectively, the dimensions of the four characteristic matrices are unified to (50, 64, 64).
Finally, the four processed feature matrices were spliced to obtain the fused feature matrix (200 × 64 × 64), where 200 corresponds to the number of samples, and 64 × 64 is the size of the feature map after interpolation. The fused feature matrix was used as the input of the classification network.

Classification Network
To combine local and global features, global self-attention [27] is taken into account and quadratic complexity is reduced. We introduced a new multi-axis self-attention module. By simply decomposing the spatial axis, full-size attention is decomposed into two sparse forms: local and global. Local and global space can be executed in a single block before interaction, effectively combining local and global features. MBConv was first proposed by Howard [28]. The convolution of linear bottleneck is an inversion layer, which consists of depth-wise separable convolution, a squeeze-excitation layer (SE), and ordinary convolution. Depth-separable convolution can reduce model parameters, thus improving model efficiency [29]. At the same time, it can be regarded as conditional position coding, which causes the model to have a clear position coding layer, and the size is 3×3. General convolution is applied to fully extract the feature information in each input channel to complete the complementary information extraction between multiple channels. Figure 5 shows the network structure. We will go into detail about the composition of each module next. and the size is 3×3. General convolution is applied to fully extract the feature information in each input channel to complete the complementary information extraction between multiple channels. Figure 5 shows the network structure. We will go into detail about the composition of each module next.  (200, 64, 64). Subsequently, the multi-axis attention module receives the fusion feature matrix as input. Following the three-layer attention module, the feature matrix's dimension transforms to (4,4). The output feature vector obtained from the fusion of the feature matrix in the local attention module and global attention module is used for classification.
At first, the network has two layers of convolutional networks to capture the feature information adequately, and then it enters the local self-attention module and the global self-attention module, respectively, in parallel. Finally, the two are integrated and output classification after full connection. Local attention is realized by window attention. For the input eigenmatrix X ∈ R , it is converted into a shape tensor Figure 5. Structure of multi-axis self-attentive model. We propose a model that utilizes four dimensions as the input for the feature matrix. Firstly, the feature fusion network generates the fusion feature matrix with a dimension of (200, 64, 64). Subsequently, the multi-axis attention module receives the fusion feature matrix as input. Following the three-layer attention module, the feature matrix's dimension transforms to (4,4). The output feature vector obtained from the fusion of the feature matrix in the local attention module and global attention module is used for classification.
At first, the network has two layers of convolutional networks to capture the feature information adequately, and then it enters the local self-attention module and the global self-attention module, respectively, in parallel. Finally, the two are integrated and output classification after full connection. Local attention is realized by window attention. For the input eigenmatrix X ∈ R H 1 ×W 1 ×C , it is converted into a shape tensor (H 1 /P × p , W 1 /P × p, C) to represent the windows that are divided into non-overlapping ones, and (H 1 /P ×W 1 /P, P× P, C) is obtained, where the size of each window is P × P, and there are (H 1 W 1 /P) 2 total windows. Finally, self-attention calculation is applied to each window, and the original shape is restored after calculation.
As shown in Figure 6, we set the H 1 and W 1 to 8, and P to 4 to facilitate the process, with the same color representing mixing in space through the self-attention operation.
feature matrix with a dimension of (200, 64, 64). Subsequently, the multi-axis attention module receives the fusion feature matrix as input. Following the three-layer attention module, the feature matrix's dimension transforms to (4,4). The output feature vector obtained from the fusion of the feature matrix in the local attention module and global attention module is used for classification.
At first, the network has two layers of convolutional networks to capture the feature information adequately, and then it enters the local self-attention module and the global self-attention module, respectively, in parallel. Finally, the two are integrated and output classification after full connection. Local attention is realized by window attention. For the input eigenmatrix X ∈ R , it is converted into a shape tensor H /P p, W /P p, C) to represent the windows that are divided into non-overlapping ones, and H /P×W /P, P×P, C is obtained, where the size of each window is P × P, and there are H W /P ^2 total windows. Finally, self-attention calculation is applied to each window, and the original shape is restored after calculation.
As shown in Figure 6, we set the H and W to 8, and P to 4 to facilitate the process, with the same color representing mixing in space through the self-attention operation.  For global attention, inspired by window attention, we cite a simple and effective way to obtain sparse global attention-grid attention. Instead of using a fixed window size to divide the feature matrix graph, we used a G × G fixed uniform grid of A to transform the tensor grid into shape (G × G , H 2 /G × W 2 /G, C), to obtain a window H 2 /G × W 2 /G with an adaptive size. Finally, we use self-attention calculation on G × G, thus indirectly realizing the global interaction. In this way, no matter how the height and width of the input feature matrix graph change, our final feature graph will only divide the specified windows in space, which will reduce the amount of computation. The original shape was restored after calculation.
The calculation process is shown in Figure 7. As with window attention, H 2 and W 2 in the figure are set to 8, and the hyper-parameter G is set to 4. By using the same fixed window and grid size (P = G = 4), you can balance the operations between local and global, both of which have only linear complexity in terms of space size and sequence length. We put inverted mobile bottleneck convolution in front of each attention mechanism, to further improve the generalization and ability of the network model.

OR PEER REVIEW 12 of 20
For global attention, inspired by window attention, we cite a simple and effective way to obtain sparse global attention-grid attention. Instead of using a fixed window size to divide the feature matrix graph, we used a G G fixed uniform grid of A to transform the tensor grid into shape G G, H /G W /G, C , to obtain a window H /G W /G with an adaptive size. Finally, we use self-attention calculation on G G, thus indirectly realizing the global interaction. In this way, no matter how the height and width of the input feature matrix graph change, our final feature graph will only divide the specified windows in space, which will reduce the amount of computation. The original shape was restored after calculation.
The calculation process is shown in Figure 7. As with window attention, H2 and W2 in the figure are set to 8, and the hyper-parameter G is set to 4. By using the same fixed window and grid size (P = G = 4), you can balance the operations between local and global, both of which have only linear complexity in terms of space size and sequence length. We put inverted mobile bottleneck convolution in front of each attention mechanism, to further improve the generalization and ability of the network model.

Results
In this section, we conducted subject-dependent experiments to verify the emotion classification performance of the model based on EEG datasets of hearing-impaired subjects and non-hearing-impaired subjects. The training set and test set were divided in a

Results
In this section, we conducted subject-dependent experiments to verify the emotion classification performance of the model based on EEG datasets of hearing-impaired subjects and non-hearing-impaired subjects. The training set and test set were divided in a ratio of 7:3. The experimental platform is NVIDIA GeForce RTX 3050 Ti Laptop GPU. The super parameter settings of the classification model are shown in Table 2. In the three-classification task, happy was classified as a positive emotion, neutral as a neutral emotion, and sad as a negative emotion. Table 3 shows the results of the emotion classification of the subjects of hearing-impaired subjects based on the multi-axis self-attention network model. Subject 9 achieved the highest classification performance in both three-classification (74.78%) and five-classification (53.76%), while subject 1 had the worst classification performance with accuracies of 67.32% and 47.25%. The average accuracy of the proposed method reached 70.72% in the three-classification and 50.15% in the five-classification.  Table 4 lists the EEG signals of 20 non-hearing-impaired subjects during the experiment. Similarly, the emotion classification performance of the network model has also achieved a good performance, with an average accuracy of 72.05% for the three-classification and 51.53% for the five-classification. Subject 11 achieved the highest classification performance in three-classification (75.47%), while subject 10 had the worst classification performance (69.77%). Subject 14 achieved the highest accuracy in five-classification (53.88%), while subject 5 had the worst classification performance (48.79%). Compared with hearing-impaired subjects, the emotion classification effect of nonhearing-impaired subjects is better. It can be speculated that this result may be related to the difficulty of facial emotion recognition of hearing-impaired subjects under the influence of physiological factors and social environment factors, which is the same as the conclusion of previous researchers [30,31].
To study the classification performance of each emotion based on the multi-axis selfattention network model, we used the confusion matrix. Each row of the confusion matrix represents the real category of data, and each column represents the predicted category.
The results of emotion classification are shown in Figure 8. Firstly, we analyzed the confusion matrix of three-classification. By comparing Figure 8a,b, it can be seen that the positive emotion recognition effect of non-hearing-impaired subjects and hearing-impaired subjects is similar. It can be seen from Figure 8a that non-hearing-impaired subjects have the best neutral emotion recognition effect, and Figure 8b shows that hearing-impaired subjects have the best positive emotion recognition effect. Compared with Figure 8a,b, hearing-impaired subjects have slightly better positive emotion recognition effect than non-hearing-impaired subjects, while neutral and negative emotion recognition effect is poor.
In the five-classification tasks, the neutral emotion classification effect of non-hearingimpaired subjects was still the best, and the happy emotion recognition effect of hearingimpaired subjects was the best. By comparison with Figure 8c,d, compared with nonhearing-impaired subjects, hearing-impaired subjects have a better recognition effect on happy emotions, while neutral, sad, angry, and fear recognition effect is lower than nonhearing-impaired subjects. In addition, 22% of the hearing-impaired subjects misclassified anger as fear and 21% misclassified fear as anger. Therefore, it can be speculated that hearing-impaired subjects may have difficulty recognizing anger and fear.
positive emotion recognition effect of non-hearing-impaired subjects and hearing-impaired subjects is similar. It can be seen from Figure 8a that non-hearing-impaired subjects have the best neutral emotion recognition effect, and Figure 8b shows that hearing-impaired subjects have the best positive emotion recognition effect. Compared with Figure  8a,b, hearing-impaired subjects have slightly better positive emotion recognition effect than non-hearing-impaired subjects, while neutral and negative emotion recognition effect is poor. In the five-classification tasks, the neutral emotion classification effect of non-hearing-impaired subjects was still the best, and the happy emotion recognition effect of hearing-impaired subjects was the best. By comparison with Figure 8c,d, compared with nonhearing-impaired subjects, hearing-impaired subjects have a better recognition effect on Compared with the hearing-impaired subjects, the non-hearing-impaired subjects are better in the identification ability of negative emotion (sadness, anger, fear), 51%, 49%, and 49%, respectively. The difference in fear emotion classification between hearingimpaired subjects and non-hearing-impaired subjects was the largest, with a difference of 4%. Therefore, it is speculated that, compared with non-hearing-impaired subjects, hearing-impaired subjects have difficulty in recognizing negative facial emotions, which is consistent with our previous video-based study.

Discussion
To verify the effectiveness of the proposed feature selection method, the comparison experiments of different feature matrix combinations were carried out by both hearingimpaired subjects and non-hearing-impaired subjects. The results appear in Table 5.
The SSM(OEF) and QSM(OEF) represent the subtract symmetric feature matrix and the quotient symmetric matrix based on the original EEG signal. SSM(DE) and QSM(DE) represent the symmetric difference feature and the symmetric quotient feature based on the differential entropy feature. The accuracy of the three-classification and five-classification tasks of the hearing impaired and non-hearing-impaired subjects by using DE features are close to that of SSM(OEF) and QSM(OEF). Compared with SSM(OEF) and QSM(OEF), SSM(DE) and QSM(DE) have better performance in all classification tasks. Concretely, QSM(DE) has the best performance in single feature classification, SSM(DE) + QSM(DE) has the best performance when two feature matrices are fused, and SSM(OEF) + QSM(OEF) + SSM(DE) has the best classification effect when three feature matrices are fused in both three-classification and five-classification tasks. Moreover, the fusion of four feature matrices obtains the best performance. The average classification accuracy of the three-classification tasks and the five-classification tasks reached 70.2% and 50.15%, and 72.05% and 51.53% for hearing-impaired subjects and non-hearing-impaired subjects, respectively. It can be inferred that the features in different domains may be complementary in EEG emotion recognition.   Table 6. The proposed method achieves outstanding performance. In this work, window attention and network attention are used for local interaction and global interaction, respectively. It may be useful for emotional representation information extraction. In order to further explore the key regions affecting the emotions of hearing-impaired subjects, we drew brain topographic maps with different frequency bands based on differential entropy features. As Figure 9 shows, by comparing brain topographic maps of different emotions, it was observed that the energy of different emotions was different in the frontal, temporal, parietal, and occipital lobes. The negative emotion was identified by the changes near the temporal lobe, while the happy emotion was identified by the parietal lobe and the right temporal lobe. In contrast, positive emotion was identified by the occipital lobe and temporal lobe. The difference, however, is that the discriminative brain regions of the hearing-impaired subjects are also located in the parietal lobe. According to studies of brain function, the parietal lobe is involved in the integration of vision and visual processing. This may indicate that visual information plays a more important role in the emotional In the process of recognizing anger and fear, it can be seen that the key areas identified by hearing-impaired subjects overlap, which is different from that of non-hearing-impaired subjects. For anger and fear, there was no significant overlap between the activated electrodes of the hearing impaired and non-hearing-impaired subjects across the five bands.
The hearing-impaired subjects activated a large number of brain regions in the δ and θ bands, while the non-hearing-impaired subjects only had a small number of activated electrodes in the occipital and parietal lobes. In the latter three frequency bands, the overlap rate of the activated electrode regions between the hearing-impaired and non-hearingimpaired subjects was also very low, which partly explains why the hearing-impaired subjects were less effective in recognizing anger and fear than the non-hearing-impaired subjects. This is related to the bias of hearing-impaired subjects in recognizing negative emotions. The distribution of emotion-discriminating brain area in the temporal lobe of hearing-impaired subjects was the same as that of non-hearing-impaired subjects.
The difference, however, is that the discriminative brain regions of the hearingimpaired subjects are also located in the parietal lobe. According to studies of brain function, the parietal lobe is involved in the integration of vision and visual processing. This may indicate that visual information plays a more important role in the emotional discrimination of hearing-impaired subjects due to the loss of hearing ability. In the absence of auditory channels, hearing-impaired subjects may use the coordination of multiple brain regions to complete the acquisition and expression of emotional information.

Conclusions
In this paper, the emotional-faces-induced EEG emotion recognition scheme was proposed. We collected EEG signals from hearing-impaired subjects and non-hearingimpaired subjects when they were watching the different face pictures. To obtain the spatial domain feature, the SSM and QSM were based on the original signals and DE features, respectively, which were reflected as four feature matrices. The classification network model based on the multi-axis self-attention mechanism is used for emotion recognition. The quadratic complexity of ordinary attention is reduced to linearity by using a multi-axis self-attention module without loss of non-locality. The spatial interaction between local and global can be realized in each block, effectively combining local and global characteristics. The promising results were obtained both in hearing-impaired and non-hearing-impaired subjects for three-classification and five-classification tasks with the proposed four-feature matrix fusion strategy. It proves that the complementarity between features can be used to improve the effect of emotion classification.
Based on the classification results of the three and five categories of emotion, the differences in the ability of non-hearing-impaired subjects and hearing-impaired subjects to recognize different emotions and the differences in the electrodes of active brain regions were analyzed in detail through the confusion matrix and brain topographic map. We find that the key areas identified by hearing-impaired subjects overlap and are obviously different from that of non-hearing-impaired subjects in the process of recognizing anger and fear. This is related to the bias of hearing-impaired subjects in recognizing negative emotions. Additionally, we also found that hearing-impaired subjects may use the coordination of multiple brain regions to complete the acquisition and expression of emotional information in the absence of auditory which meets the conclusion of hearing-impaired people in video stimulation.
In future work, we will extend our dataset and focus on developing general feature extraction methods and classification models for hearing-impaired and non-hearingimpaired people.

Informed Consent Statement:
Written informed consent has been obtained from the patient(s) to publish this paper. This section is uploaded as supplementary material.