The Effect of Time Window Length on EEG-Based Emotion Recognition

Various lengths of time window have been used in feature extraction for electroencephalogram (EEG) signal processing in previous studies. However, the effect of time window length on feature extraction for the downstream tasks such as emotion recognition has not been well examined. To this end, we investigate the effect of different time window (TW) lengths on human emotion recognition to find the optimal TW length for extracting electroencephalogram (EEG) emotion signals. Both power spectral density (PSD) features and differential entropy (DE) features are used to evaluate the effectiveness of different TW lengths based on the SJTU emotion EEG dataset (SEED). Different lengths of TW are then processed with an EEG feature-processing approach, namely experiment-level batch normalization (ELBN). The processed features are used to perform emotion recognition tasks in the six classifiers, the results of which are then compared with the results without ELBN. The recognition accuracies indicate that a 2-s TW length has the best performance on emotion recognition and is the most suitable to be used in EEG feature extraction for emotion recognition. The deployment of ELBN in the 2-s TW can further improve the emotion recognition performances by 21.63% and 5.04% when using an SVM based on PSD and DE features, respectively. These results provide a solid reference for the selection of TW length in analyzing EEG signals for applications in intelligent systems.


Introduction
Human emotion recognition is a critical research topic in brain-computer interaction (BCI) [1][2][3][4]. Most of the current human emotion recognition approaches are based on facial expression images from cameras [5]. However, the real emotion may be hidden behind facial expressions consciously or unconsciously, which would make the camerabased approaches invalid. Moreover, the effectiveness of these approaches would also be limited in poor environments with low illumination or rapidly changing light distributions on human faces (e.g., in nighttime driving). Therefore, using physiological signals to directly recognize human emotions without environmental effects or fake facial expressions is essential.
Among the various physiological signals, electroencephalogram (EEG) has been frequently reported to be closely and directly related with human emotions in previous studies [6][7][8][9][10]. George et al. [11] applied fast Fourier transformation (FFT) and frequency bandpass to extract features from EEG signals and performed emotion recognition in valence and arousal dimensions with a support vector machine (SVM). Asghar et al. [12] proposed a bag of deep features (BoDF) model to reduce the EEG feature dimensionality and adopted an SVM and k-nearest neighbors (KNN) to perform EEG-based emotion recognition. Pan et al. [13] applied the logistic regression (LR) algorithm with Gaussian kernel and Laplacian prior for EEG-based emotion recognition by comparing with Gaussian naive bayes (GNB) and SVM classifiers.
However, most of these studies use the complete EEG samples with different time durations as inputs for model training. This treatment would make the trained model not suitable to be applied for online or real-time emotion recognition because the learned knowledge of the trained model is all about the characteristics of the complete samples for post-detection which is not suitable for online or real-time applications [14]. Moreover, mixing the samples with different durations together would also make the trained model not aware of the changing characteristics of EEG signals in temporal sequences [15,16].
To this end, various time windows (TWs) have been developed and used in signal analysis of EEG temporal sequences. Abtahi et al. [17] used a 20-millisecond TW with 10-millisecond offset to cut EEG signals into a time sequence to determine the suitable analysis model for EEG signals. Lin et al. [18] applied a 1-s TW without overlapping on channels of the EEG data to compute the EEG spectrogram which would be used to investigate the relationship between emotional states and brain activities with machinelearning algorithms. Zheng et al. [19] used a non-overlapped 4-s TW with short-time Fourier transform for extracting EEG features to perform emotion recognition tasks with feature level fusion strategy and decision level fusion strategy, respectively. Zhuang et al. [20] took EEG data in every 5-s TW as material for empirical mode decomposition (EMD), which was beneficial for EEG-based emotion recognition performance. It can be observed that various TW lengths were used in the signal processing of EEG signals. However, there is currently no criterion or prior knowledge on the temporal scale (i.e., TW length) to measure EEG data for emotion recognition.
Moreover, in the development of emotion recognition with EEG data, a significant problem is the negative influence of individual differences [15,16] which leads to diversified EEG response patterns to affect the generalization capabilities of classifiers across subjects. Gianotti et al. [21] reported the neural signatures underlying individual participants when they were being looked at. Matthews et al. [22] found that differences existed in EEG signal responses through an experiment on 150 patients who were asked to perform two signal detection tasks in a complex and simulated operational environment. Meanwhile, many other studies have proposed various data processing methods for subject-independent emotion recognition [15,23,24], but they process EEG data without considering differences among experimental groups.
Therefore, this paper aims to examine the effectiveness of different TW lengths on emotion recognition based on batch normalized EEG signals for better individualized emotion recognition performance. The main contributions of this paper are two-fold. Firstly, the best TW length to extract features for EEG-based emotion recognition is determined. This would fill the research gap on the selection of TW for feature extraction to facilitate EEG-based emotion recognition. Secondly, an experiment-level batch normalization (ELBN) method is newly applied in EEG feature processing to alleviate the negative impact caused by individual differences in different experiments. This method can help extract useful features without being greatly affected by human behavioral differences across experiments, which would greatly improve emotion recognition performances in online or real-time applications.
Another significant characteristic of EEG signals is that EEG is highly subject to various individual differences, which brings difficulties when the pre-trained model is applied on a new subject directly. To address this problem, previous studies have proposed a variety of approaches for EEG data processing. Li et al. [23] proposed a normalization method where EEG signals in each electrode channel of each person was normalized into the range of [0, 1]. Another study [15] presented a domain adaptation method where task-invariant features and task-specific features integrated in unified framework models were adopted to eliminate individual differences. Lu et al. [24] also developed a dynamic entropy-based pattern for subject-independent emotion recognition. However, most of these previous studies take EEG data from different subjects and different experimental batches as an integral whole without considering the influence of both inter-subjects and inter-experiments difference, meaning subject-independent emotion recognition still faces challenges. To further diminish the effect of individual difference, the inter-subjects and inter-experiments differences should be taken into consideration when processing EEG data.

EEG Features
To analyze EEG signals from different perspectives, various features have been extracted to describe brain activity information. Unde et al. [31] used power spectral density (PSD), which was defined as the distribution of signal power over frequency, to show the strength of energy in a frequency domain. Shi et al. [32] applied differential entropy (DE), which was obtained by calculating the entropy of a continuous EEG sequence, to measure the complexity of EEG signals. Frantzidis et al. [33] used amplitude and latency of eventrelated potentials (ERPs) as features in their research. However, detecting emotion-related ERPs is difficult in online applications as the onset is usually unknown. Kroupi et al. [34] employed the non-stationary index (NSI) to measure the inconsistency of EEG signals, which is defined as the standard deviation of all the means from the EEG signal pieces. Petrantonakis et al. [35] introduced higher order crossings (HOC)-based features to capture the oscillatory pattern of EEG signals [36].
Among these various EEG features, PSD and DE are two commonly used and wellaccepted features used to analyze human's EEG activities. PSD features are used to represent the distribution and energy strength of signal power over a frequency [8,9]. DE are efficient numerical features employed to measure the signal complexity in EEG analysis [37], and they perform well in differentiating EEG signals between low and high frequency energies [7]. It has been demonstrated that PSD and DE features could effectively describe EEG signals to achieve high accuracies for emotion recognition [37]. Therefore, we used PSD and DE features for emotion recognition in this study.

Dataset and Experiments
The SJTU emotion EEG dataset (SEED) [29] is a popular publicly available EEG dataset for various purposes on emotional analysis. The data collection work was performed by the BCMI laboratory in Shanghai Jiao Tong University in 2015. The ESI NeuroScan system was used to record the EEG data with a sample rate of 1000 Hz. There was a total of 62 electrode channels according to the international 10-20 system for EEG collection.
Researchers usually elicit specific emotions of subjects by audio or video clips and extract the corresponding EEG data for analysis [17,29,38]. Similarly, film clips were used for emotion induction in SEED. To elicit the different target emotions (i.e., positive, neutral and negative), 15 Chinese film clips were selected following the criterion as follows: (a) the length of the whole experiment should be limited in a reasonable range (e.g., 2-5 min) to avoid fatigue, (b) the content of the films should be easily understood without extra explanation, and (c) only one single target emotion can be elicited through the film content. The film clips were edited so that the selected video content for emotion elicitation could be effective during the approximately 4 minutes' watching. The selected film clips for emotion induction in our experiments are listed in Table 1. There are five film clips for each emotion type. More details of the dataset and experiments can be found in [39]. The 15 film clips in Table 1 were separately presented to subjects in 15 trials for one experiment (see Figure 1). In each trial, a starting hint was given 5 s before the start of each clip. When the film clip was finished, 45 s were given for each subject to complete the questionnaire reporting their immediate emotional reactions to the film clip that they had just watched. Details of the questionnaire can be found in [6]. Subsequently, another 15 s were provided for rest before the start of the next trial. The order of emotions presented in the experiments was 1, 0, −1, −1, 0, 1, −1, 0, 1, 1, 0, −1, 0, 1, −1 (1 for positive, 0 for neutral, −1 for negative). According to the presented emotion order, the distribution of the three emotion categories is balanced in each experiment (i.e., each emotion category has 5 corresponding trials in each experiment, and these trials are distributed following the order above). The collected EEG data were downsampled to 200 Hz and filtered with a 0-75 Hz frequency band to filter the noise and remove the artifacts.
by the BCMI laboratory in Shanghai Jiao Tong University in 2015. The ESI NeuroScan system was used to record the EEG data with a sample rate of 1000 Hz. There was a total of 62 electrode channels according to the international 10-20 system for EEG collection.
Researchers usually elicit specific emotions of subjects by audio or video clips and extract the corresponding EEG data for analysis [17,29,38]. Similarly, film clips were used for emotion induction in SEED. To elicit the different target emotions (i.e., positive, neutral and negative), 15 Chinese film clips were selected following the criterion as follows: (a) the length of the whole experiment should be limited in a reasonable range (e.g., 2-5 min) to avoid fatigue, (b) the content of the films should be easily understood without extra explanation, and (c) only one single target emotion can be elicited through the film content. The film clips were edited so that the selected video content for emotion elicitation could be effective during the approximately 4 minutes' watching. The selected film clips for emotion induction in our experiments are listed in Table 1. There are five film clips for each emotion type. More details of the dataset and experiments can be found in [39]. neutral World Heritage in China 5 The 15 film clips in Table 1 were separately presented to subjects in 15 trials for one experiment (see Figure 1). In each trial, a starting hint was given 5 s before the start of each clip. When the film clip was finished, 45 s were given for each subject to complete the questionnaire reporting their immediate emotional reactions to the film clip that they had just watched. Details of the questionnaire can be found in [6]. Subsequently, another 15 s were provided for rest before the start of the next trial. The order of emotions presented in the experiments was 1, 0, −1, −1, 0, 1, −1, 0, 1, 1, 0, −1, 0, 1, −1 (1 for positive, 0 for neutral, −1 for negative). According to the presented emotion order, the distribution of the three emotion categories is balanced in each experiment (i.e., each emotion category has 5 corresponding trials in each experiment, and these trials are distributed following the order above). The collected EEG data were downsampled to 200 Hz and filtered with a 0-75 Hz frequency band to filter the noise and remove the artifacts. There were 15 young subjects (7 males and 8 females; age: 23.27 ± 2.37 years) participating in the experiments in the SEED dataset. Each subject repeated the experiment three times with an interval of one week or longer. Therefore, in the SEED dataset, there were 45 experiments across the 15 subjects. Since each experiment had 15 trails, there were 675 trials in total across the 15 subjects. For each emotion category (i.e., positive, neutral, or There were 15 young subjects (7 males and 8 females; age: 23.27 ± 2.37 years) participating in the experiments in the SEED dataset. Each subject repeated the experiment three times with an interval of one week or longer. Therefore, in the SEED dataset, there were 45 experiments across the 15 subjects. Since each experiment had 15 trails, there were 675 trials in total across the 15 subjects. For each emotion category (i.e., positive, neutral, or negative), there were 225 trials, which means that the emotions were balanced in the SEED dataset.

Extracting Features Based on TW
The minimum duration of the collected trials was about 180 s; hence, each chan the EEG signals can be segmented into 180 1-s epochs without overlapping. Ther there would be 58 × 180 epochs from the collected EEG signals in each trial. PSD fe and DE features were calculated in each epoch in the five given frequency bands, re tively. Therefore, the format of features in each trial was defined as (58, 5, 180), wh represents 58 channels, 5 represents 5 frequency bands, and 180 represents the total ber of features extracted from the epochs in the corresponding trial. In total, there 675 trials in all the 45 experiments. A total of 11 different lengths of TW were examin investigate the optimal TW length for EEG data extraction in this study. See Table 2 PSD features and DE features were separately calculated and averaged across the e in each TW. To compare the effectiveness of the examined features from differen lengths, six classical classifiers were used for emotion recognition, including KNN SVM, GNB, Multilayer Perceptron (MLP), and Bootstrap Aggregating (Bagging). classifiers are the most frequently ones used with high accuracies and strong adap ties to different classification tasks [15,42,43]. A machine learning module in Python

Extracting Features Based on TW
The minimum duration of the collected trials was about 180 s; hence, each channel of the EEG signals can be segmented into 180 1-s epochs without overlapping. Therefore, there would be 58 × 180 epochs from the collected EEG signals in each trial. PSD features and DE features were calculated in each epoch in the five given frequency bands, respectively. Therefore, the format of features in each trial was defined as (58, 5, 180), where 58 represents 58 channels, 5 represents 5 frequency bands, and 180 represents the total number of features extracted from the epochs in the corresponding trial. In total, there were 675 trials in all the 45 experiments. A total of 11 different lengths of TW were examined to investigate the optimal TW length for EEG data extraction in this study. See Table 2. Both PSD features and DE features were separately calculated and averaged across the epochs in each TW. To compare the effectiveness of the examined features from different TW lengths, six classical classifiers were used for emotion recognition, including KNN, LR, SVM, GNB, Multilayer Perceptron (MLP), and Bootstrap Aggregating (Bagging). These classifiers are the most frequently ones used with high accuracies and strong adaptabilities to different classification tasks [15,42,43]. A machine learning module in Python called sklearn was used to construct models, and the relevant parameter settings are listed in Table 3.

Experimental-Level Batch Normalization (ELBN)
To reduce the impact of individual difference on EEG-based emotion recognition [23], an experiment-level batch normalization (ELBN) method was applied on the obtained PSD features and DE features, respectively. The definition of ELBN is shown as follows: where F i and F BNi represent the original value of a specific feature and the value of the feature with ELBN in an experiment, respectively, while F min and F max represent the minimum and maximum values of the corresponding feature in the same experiment, respectively. Features extracted from one frequency band in one channel are calculated in each trial, and the features from all 15 trials in the same experiment are normalized following Formula (1). The normalization occurred across 15 trials in one experiment. Each trial can provide one feature from one frequency band in one channel when using a selected TW, and then the 15 trials contribute to the 15 features for normalization. As for the 180 s TW where there is just one TW, the normalization will be performed one time in one experiment with the minimum and maximum values across the 15 trials. Our proposed ELBN is conducted within an experiment to avoid interference or noise from other factors (e.g., change of body status in different experiments), which is newly developed to solve the individual difference and baseline deviation problems in the collected data from different subjects on different days. The protocol of ELBN is shown in Figure 3. Although this idea seems simple, to the best of our knowledge, it has never been used in the previous emotion recognition studies based on EEG signals.
the individual difference and baseline deviation problems in the collected data from different subjects on different days. The protocol of ELBN is shown in Figure 3. Although this idea seems simple, to the best of our knowledge, it has never been used in the previous emotion recognition studies based on EEG signals.

The Effect of TW Length on Emotion Recognition without ELBN
Processed PSD features and DE features with different TW lengths were separately fed into the six classifiers. The training data contain features extracted from 12 trials in each experiment, while the testing data contain features extracted from the other 3 trials from the same experiment. Both training data and testing data are evenly distributed across the three different emotions. A total of 10 random sets (see Table 4) of possible combinations of training data and testing data were selected, and their averaged accuracies in each classifier were used as final recognition results. Although there are some replacement trials in the 10 randomized sets, the distribution of trials for training (or testing) has been randomized as evenly as possible. K-fold is not applied because the order of emotions presented in one experiment across the 15 trials is disordered (i.e., 1, 0, −1, −1, 0, 1, −1, 0, 1, 1, 0, −1, 0, 1, −1 with 1 for positive, 0 for neutral, and −1 for negative). Traditional procedure of k-fold is not suitable for this case when we try to divide 15 trials into a training set and a test set to ensure that the emotions are evenly distributed. If there are no positive, neutral, or negative emotions in the test set, the trained model will not be reliable because of the unbalanced training data, or the testing results for a specific emotion type cannot be computed, making the averaged accuracies not reasonable.

The Effect of TW Length on Emotion Recognition without ELBN
Processed PSD features and DE features with different TW lengths were separately fed into the six classifiers. The training data contain features extracted from 12 trials in each experiment, while the testing data contain features extracted from the other 3 trials from the same experiment. Both training data and testing data are evenly distributed across the three different emotions. A total of 10 random sets (see Table 4) of possible combinations of training data and testing data were selected, and their averaged accuracies in each classifier were used as final recognition results. Although there are some replacement trials in the 10 randomized sets, the distribution of trials for training (or testing) has been randomized as evenly as possible. K-fold is not applied because the order of emotions presented in one experiment across the 15 trials is disordered (i.e., 1, 0, −1, −1, 0, 1, −1, 0, 1, 1, 0, −1, 0, 1, −1 with 1 for positive, 0 for neutral, and −1 for negative). Traditional procedure of k-fold is not suitable for this case when we try to divide 15 trials into a training set and a test set to ensure that the emotions are evenly distributed. If there are no positive, neutral, or negative emotions in the test set, the trained model will not be reliable because of the unbalanced training data, or the testing results for a specific emotion type cannot be computed, making the averaged accuracies not reasonable. The recognition results of PSD features and DE features when using different TW lengths are shown in Tables 5 and 6, respectively. The accuracies were calculated using separate EEG frequency bands. The results show that the highest accuracy is achieved when using the LR classifier. The best accuracies when using PSD features or DE features are 67.85% and 78.67%, respectively. The maximum differences across different TW lengths are 3.56% when using PSD features and 2.59% when using DE features. Furthermore, though the number of features increases with the decreasing TW length, the recognition accuracies are barely growing, indicating that the number of features has little influence on recognition results. These results show that the emotion recognition accuracies are limited when using the classical classifiers based on either PSD or DE features, and the accuracy differences when using different TW lengths are not large.

The Effect of TW Length on Emotion Recognition with ELBN
PSD features and DE features with ELBN were separately fed into the six classifiers. Features from the 10 given random sets were used as inputs and their averaged accuracies in each classifier were used as the final recognition results. The recognition results of PSD features and DE features are shown in Tables 7 and 8, respectively. The results indicate that ELBN performs well on improving emotion recognition accuracies based on EEG features. Compared with the emotion recognition performance when using the same features without ELBN in Tables 4 and 5, the emotion recognition accuracies when using features with ELBN are greatly improved. The best accuracy of PSD features is up to 79.48% which is 11.53% higher than that without ELBN, and the best accuracy of DE features is up to 82.96% which is 4.29% higher than the number without ELBN. The increased accuracy by ELBN for PSD features is more than 10% for all the examined classifiers, and the greatest improvement (i.e., 21.63%) is achieved when using an SVM. When using DE features, the greatest contribution of our proposed ELBN is 20.67%. These significant accuracy improvements show that our proposed ELBN method can effectively retain more temporal  When using different TW lengths for emotion recognition, the results shown in Tables 6 and 7 show that the differences between TW lengths are more obvious than the results in Tables 3 and 4. The emotion recognition accuracy when using the 2-s or 3-s TW is 6.15% higher than the number when using the 180-s TW for the LR classifier based on PSD features. The highest accuracy when using the 2-s TW based on DE features is 7.18% higher than the accuracy when using the 180-s TW for the LR classifier. By comparing the results from different TW lengths, it can be observed that the best emotion recognition accuracy is achieved when using the 2-s TW together with the LR classifier, hence this TW length together with the LR classifier is used for the following online recognition.
Our results show that the 2-s TW with ELBN has the best emotion recognition performance. In general, a longer length of TW will lead to fewer amounts of input, which is beneficial for reducing computational cost [44], while a shorter length will expand the input scale of features from the temporal dimension, which is capable of capturing EEG transient changes [45]. However, employing a longer length of TW will undermine reading temporal EEG data, while using a shorter length will extend the computing time that is inconvenient to online affective computing. Given that different lengths of TW have different advantages and disadvantages, a suitable TW length that can balance the contradiction between them is required, and our results show that the 2-s TW is an optimal choice with the highest recognition accuracy. As shown in Table 2, compared with the original 1-s TW length, the scale of input features is halved when using the 2-s TW, contributing to decreasing the computational cost. Meanwhile, as shown in Tables 5 and 6, the 2-s TW is able to keep emotion recognition accuracy at a relatively high level, indicating that temporal characteristics of EEG signals can be effectively extracted for emotion recognition.
ELBN performs well on improving EEG-based emotion recognition according to the results in Tables 7 and 8. To explore its mechanism for accuracy improvement, samples were randomly selected from the SEED dataset to demonstrate the changes of EEG features after ELBN. Given that differences across EEG-based emotions contribute to emotion recognition [46], significance analysis of PSD features and DE features among the three emotions was conducted on the five frequency bands to explore the sensitivities of features for emotion discrimination before and after applying ELBN. The nonparametric Kruskal-Wallis test was applied on mean values of features in each trial, and the significance analysis results are shown in Table 9. The results show that there are more features with statistical significances among the examined emotions after applying ELBN, indicating that feature sensitivities to human emotions increase after applying ELBN. This would probably be the reason for recognition accuracy improvement after applying ELBN.

Online Emotion Recognition
A 2-s sliding TW with a 1-s step was used for online emotion recognition with the LR classifier. The 10 random sets were used with 540 × 10 samples for training and 135 × 10 samples for testing, where 540 and 135 are the numbers of samples for training and testing, respectively, in one random set, and the number 10 means that there are 10 random sets. In total, 675 × 10 × 179 TWs (i.e., total samples in one random set × total number of sets × TWs used in one sample) were used to examine the online emotion recognition performance, and the mean accuracy of the testing samples was used as the final output. The results when using PSD features or DE features are illustrated in Figures 4 and 5, respectively. The obvious distance between the red area and blue/green areas in Figure 4a indicate that the positive emotion samples can be correctively recognized without being mistakenly recognized as neural or negative emotion samples. Larger gaps can be found in Figure 4b,c indicating that the neural and negative emotion samples can be more easily recognized without confusing them with the other emotion samples. Similar trends can be found in the results on DE features in Figure 5. These results show that the 2-s TW can be successfully used for online emotion recognition based on EEG signals with ELBN.
because the recognition would be delayed if a longer TW is used, particularly at the beginning of EEG signal recording where the first recognition result will only be output when the EEG signal sequence with the required time length is collected. However, there is a lack of evidence on the appropriate TW length for EEG-based emotion recognition from a psychological perspective because EEG signals are easily affected by various factors which may differ across individuals [25]. Therefore, it is difficult to determine the best TW length. To simplify this problem, we used recognition accuracy as the evaluation index for TW selection. Even shorter TW was used in previous studies for EEG data processing. For example, Abtahi et al. [17] used EEG data from a 20-millisecond TW to train long-short-term memory (LSTM) and deep belief network (DBN) for emotion analysis.

Influence of TW Length on Emotion Recognition
The selection of TW length mainly affects emotion recognition accuracy as well as model complexity caused by the number of input features. As shown in Table 2, a longer TW corresponds to fewer input features, which means that the model complexity would be lower. However, fewer input features when using a longer TW may lead to a lower recognition accuracy. The SVM classifier results in Tables 5 and 6 show that the highest recognition accuracy is achieved when the TW length is 30 s and 5 s for PSD and DE features without ELBN, respectively. However, the recognition accuracy reaches a bottleneck and would not continuously increase with the shortening TW length. The results with ELBN in Tables 7 and 8 show similar trends.
Given that a shorter TW can capture more features with a higher model complexity, but the recognition accuracy would not continuously increase with the shortening TW length, a balanced TW length selection strategy should be considered, which would be another interesting topic for further investigation. Although the LR results in Tables 7 and  8 show that the highest accuracy is achieved when using the 2-s TW, using a slightly longer TW (e.g., 3-s) would also be a good choice due the generally stable accuracy performance with a lower model complexity. Determining which TW is the best choice would rely on the selection criteria. In this paper, we do not focus on the balance strategy between recognition accuracy and model complexity but just use the recognition accuracy as the selection criteria. In future work, how to reasonably select a TW by comprehensively considering recognition accuracy and model complexity needs deeper analysis.
Another problem is that different frequency bands with different characteristics have different sensitivities to emotions [48], leading to the result that different bands may have As a common sense, human emotion usually lasts much longer than 10 s and can even last several minutes [47]. Using a short TW to continuously recognize human emotion can support online emotion monitoring applications. Given a reasonable range of TW length, a shorter TW is beneficial for a more time-efficient solution for online applications because the recognition would be delayed if a longer TW is used, particularly at the beginning of EEG signal recording where the first recognition result will only be output when the EEG signal sequence with the required time length is collected. However, there is a lack of evidence on the appropriate TW length for EEG-based emotion recognition from a psychological perspective because EEG signals are easily affected by various factors which may differ across individuals [25]. Therefore, it is difficult to determine the best TW length. To simplify this problem, we used recognition accuracy as the evaluation index for TW selection. Even shorter TW was used in previous studies for EEG data processing. For example, Abtahi et al. [17] used EEG data from a 20-millisecond TW to train long-short-term memory (LSTM) and deep belief network (DBN) for emotion analysis.

Influence of TW Length on Emotion Recognition
The selection of TW length mainly affects emotion recognition accuracy as well as model complexity caused by the number of input features. As shown in Table 2, a longer TW corresponds to fewer input features, which means that the model complexity would be lower. However, fewer input features when using a longer TW may lead to a lower recognition accuracy. The SVM classifier results in Tables 5 and 6 show that the highest recognition accuracy is achieved when the TW length is 30 s and 5 s for PSD and DE features without ELBN, respectively. However, the recognition accuracy reaches a bottleneck and would not continuously increase with the shortening TW length. The results with ELBN in Tables 7 and 8 show similar trends.
Given that a shorter TW can capture more features with a higher model complexity, but the recognition accuracy would not continuously increase with the shortening TW length, a balanced TW length selection strategy should be considered, which would be another interesting topic for further investigation. Although the LR results in Tables 7 and 8 show that the highest accuracy is achieved when using the 2-s TW, using a slightly longer TW (e.g., 3-s) would also be a good choice due the generally stable accuracy performance with a lower model complexity. Determining which TW is the best choice would rely on the selection criteria. In this paper, we do not focus on the balance strategy between recognition accuracy and model complexity but just use the recognition accuracy as the selection criteria. In future work, how to reasonably select a TW by comprehensively considering recognition accuracy and model complexity needs deeper analysis.
Another problem is that different frequency bands with different characteristics have different sensitivities to emotions [48], leading to the result that different bands may have different optimal TW lengths. In our experiments, to evaluate the comprehensive performance of TWs, features extracted from different frequency bands are put together for training or testing, making it impossible to investigate the selection of TW length for each frequency band. We will further investigate this interesting topic in our future work.

Conclusions
The effectiveness of different time window (TW) lengths on emotion recognition is examined based on EEG signals before and after applying experiment-level batch normalization (ELBN). The results show that the highest recognition accuracy is achieved when using the 2-s TW for feature extraction. The highest accuracies when using PSD features and DE features are 79.48% and 82.96%, respectively. The developed ELBN increases the feature sensitivities to emotion discrimination, which greatly contributes to the recognition accuracy improvement. A limitation of this study is that only the classical classifiers are used for emotion recognition. Advanced algorithms based on neural networks and deep learning have been extensively developed for emotion recognition in recent years [48][49][50][51]. Our future work will focus on deploying the explored sensitive features after ELBN in the 2-s sliding TW for emotion recognition using advanced algorithms and analyzing the selection of TW length for each frequency band.  Data Availability Statement: Data available in a publicly accessible repository. The data presented in this study are publicly available in SEED dataset at 10.1109/TAMD.2015.2431497, reference number [29].