Semi-Supervised Cross-Subject Emotion Recognition Based on Stacked Denoising Autoencoder Architecture Using a Fusion of Multi-Modal Physiological Signals

In recent decades, emotion recognition has received considerable attention. As more enthusiasm has shifted to the physiological pattern, a wide range of elaborate physiological emotion data features come up and are combined with various classifying models to detect one’s emotional states. To circumvent the labor of artificially designing features, we propose to acquire affective and robust representations automatically through the Stacked Denoising Autoencoder (SDA) architecture with unsupervised pre-training, followed by supervised fine-tuning. In this paper, we compare the performances of different features and models through three binary classification tasks based on the Valence-Arousal-Dominance (VAD) affection model. Decision fusion and feature fusion of electroencephalogram (EEG) and peripheral signals are performed on hand-engineered features; data-level fusion is performed on deep-learning methods. It turns out that the fusion data perform better than the two modalities. To take advantage of deep-learning algorithms, we augment the original data and feed it directly into our training model. We use two deep architectures and another generative stacked semi-supervised architecture as references for comparison to test the method’s practical effects. The results reveal that our scheme slightly outperforms the other three deep feature extractors and surpasses the state-of-the-art of hand-engineered features.


Introduction
Objective affective analysis has always been in the area of psychology over the course of the twentieth century. It is mainly in the form of psychologists determining treatment options based on patients' self-reported emotional states. However, since the beginning of the 21st century, with the rise of artificial intelligence, emotional research has broken into the field of computer science and information technology with the new name of "Affective Computing" [1], attempting to provide "Emotional Intelligence", the almost universally acknowledged gap between artificial intelligence and real machine intelligence [1][2][3][4]. Meanwhile, researchers from the fields of ethology, behaviorists, anthropology, etc., have also explored different aspects of this with various research purposes. Hence, in recent decades, effective research has been naturally interdisciplinary, pulling together knowledge from neurobiology, psychology, evolutionary biology, computer science, and beyond. Note that in this piece, "emotion" and "affection" are utilized to represent the same notion, but both are strictly distinguished from terms like "mood", "feelings", "attitude", "personality", etc.
Human emotions can be expressed through both verbal cues and nonverbal cues. The majority of pilot studies in the community used verbal expressions, particularly those derived from audio, images, and videos. Since affection is a highly personal property, emotions may be inhibited and elaborately cloaked in subjective verbal reports that vary In this paper, we aim to construct a deep-learning algorithm with an SDA architecture including unsupervised pretraining and supervised fine-tuning using backpropagation of error derivatives that can process raw data with a relatively higher accuracy. We are motivated by the need to develop an efficient deep architecture to extract more powerful and robust representations automatically, to guide the learning of supervised learning algorithms. The top issue in emotion detection is the size of databases. More powerful representations are expected to be built with a larger scale of unsupervised learning. In the presence of labeled data and an abundance of unlabeled data, semi-supervised learning is more practical. In detail, the principal contributions of this paper are as follows: 1.
We propose a semi-supervised deep-learning architecture to extract the underlying representations of pure physiological emotion data without any manual intervention.
The experimental results indicate that the Stacked Denoising Autoencoder (SDA) ranks over the other three deep networks. To the best of our knowledge, ours is the first to utilize the stacked denoising autoencoder algorithm as the feature extractor and, in a true sense, to directly take the raw data as input for a model.

2.
By looking back at the historical roots and the development of the emotion-recognition field, we implement most of the recommended well-designed features and perform a comprehensive comparative analysis of diverse existing methodologies and features on the same benchmarking database, thus proving that features from the wavelet transform are the most efficient. 3.
The fusion of two different physiological modalities is performed on a data-level, feature-level, and decision-level. The results demonstrate that the two modalities operate simultaneously, instead of "EEG first, and periphery second", as much of the literature claims.

4.
Cross-subject and single-subject accuracy show a significant gap, which echoes the conclusions of previous studies in psychobiology (emotions are individualized). This gap can be reduced as the amount of data grows. To our knowledge, ours is the first to investigate the difference in performance between cross-subject and single-subject recognition.
The remainder of this paper is organized as follows. The derivation and selection of discriminating features are discussed in Section 2. Section 3 presents the premier wellperformed recognizers in the literature. The experiment setup and result anlysis are presented in Section 4. Finally, Section 5 concludes this paper and discusses future work.

Notation
Throughout this paper, the following notation is used: The EEG trace or the periphery physiological signal within a certain epoch is expressed as a function of time ξ(t) ∈ R T , and T is the number of the time-samples in ξ. The time derivative of ξ(t) is denoted as . ξ(t).

Material
As mentioned above, the benchmarking dataset, DEAP [9], is selected to compare the performances of different representations and algorithms. The physiological data in DEAP is adjusted differently to adapt to various algorithms, including a naive way of augmenting the limited samples to take advantage of deep-learning models and reducing the dimensionality by extracting the artificially designed features and the Principal Component Analysis.

The DEAP Dataset
Mathematically modeling emotion to represent affection's latent structure is the first step in emotion-related research. Among the more popular theoretical frameworks for measuring emotion are the discrete model [5,14] and the dimensional theory. A comprehensive analysis of dimensional and discrete conceptual views on emotions has been completed by Eddie Harmon Jones et al. in their review [9], in which both models were believed to have much to offer in attempting to understand the psychology of emotions. From more recent literature, an obvious preference for a dimensional model exists in building physiological emotion databases, and it is indeed quite widespread today, gaining increasing acceptance. For example, the dataset MAHNOB-HCI [8] was recorded using the Valence-Arousal-Dominance (VAD) model; dataset DEAP [9] was also recoded along the scale of VAD. As a result, we also used the well-known VAD affection model to conduct the experiments.
The DEAP was a multimodal dataset for the analysis of human affective states. It contained the recordings of the EEG traces and 13 other periphery physiological signals of 32 subjects (half male) watching 40 one-minute music clips. Specifically, for one subject, there are 40 clips/trials. For one trial, 40 channels captured the physiological voltage fluctuations from where 45 electrodes were placed, resulting in 8064 time samples in a 60 s video time and a 3 s baseline with a 128 Hz sampling frequency for each channel considering that the inner affective states change slowly. We only use the data in the last 30 s because we need dimensionality reduction. The threshold of splitting a dimension of the affectioned model is simply set as 4.5, right in the middle of the range (0, 9), and thus obtaining the ground truth labels for the three basic recognition tasks (arousal, valence, and dominance).
For the detailed placement of 45 electrodes, refer to [9] or their website. The 13 periphery signals are combined into eight channels for the five channels of horizontal electrooculo-Entropy 2022, 24, 577 5 of 29 gram (EOG), vertical EOG, zygomaticus major electromyography (EMG), trapezius EMG, and GSR derived from the difference between two electrodes. Normalized values of the eight periphery channels are illustrated in Figure 1. Prior probabilities of 32 single-subject recognition tasks and the cross-subject recognition (marked by the red dot) on all three scales of the dimensional effect model are illustrated in Figure 2.
video time and a 3 s baseline with a 128 Hz sampling frequency for each channel consi ering that the inner affective states change slowly. We only use the data in the last 30 because we need dimensionality reduction. The threshold of splitting a dimension of t affectioned model is simply set as 4.5, right in the middle of the range (0, 9), and th obtaining the ground truth labels for the three basic recognition tasks (arousal, valen and dominance).
For the detailed placement of 45 electrodes, refer to [9] or their website. The 13 p riphery signals are combined into eight channels for the five channels of horizontal ele trooculogram (EOG), vertical EOG, zygomaticus major electromyography (EMG), trap zius EMG, and GSR derived from the difference between two electrodes. Normalized v ues of the eight periphery channels are illustrated in Figure 1. Prior probabilities of single-subject recognition tasks and the cross-subject recognition (marked by the red do on all three scales of the dimensional effect model are illustrated in Figure 2.

Augmenting the Raw Data
Existing emotional databases share the same weakness: insufficient size. In this gard, datasets of visual and vocal modalities are no exception. For cross-subject emot recognition, DEAP holds at most 1280 pieces of multimodal physiological data and cor sponding subjective ratings on five dimensions of valence, arousal, dominance, liki and familiarity, ranging from 0 to 9 continuously. The number 1280 is too few to fu exploit the power of deep-learning algorithms. In addition, even without considering insufficient data volume, directly concatenating the overall 40 channels would result i

Augmenting the Raw Data
Existing emotional databases share the same weakness: insufficient size. In this regard, datasets of visual and vocal modalities are no exception. For cross-subject emotion recognition, DEAP holds at most 1280 pieces of multimodal physiological data and corresponding Entropy 2022, 24, 577 6 of 29 subjective ratings on five dimensions of valence, arousal, dominance, liking, and familiarity, ranging from 0 to 9 continuously. The number 1280 is too few to fully exploit the power of deep-learning algorithms. In addition, even without considering the insufficient data volume, directly concatenating the overall 40 channels would result in a vector with 153,600 components (3840 × 40), where the three-second baseline is removed, and only the second half of the data is employed. As a result, 1280 pieces of input vectors of a length of 153,600 would lead to a shallow but wide network to avoid preventing overfitting. Therefore, some deep-learning pioneers in physiological emotion detection have tried to obtain smaller and more acceptable compressed representations from raw data.
However, deep-learning is desirable and valuable because it requires input of the original data rather than the extracted or compressed representations in addition to preprocessing (such as patching the missing values and normalization to eliminate magnitude differences). Ingeniously, we propose to consider signals captured from each channel as a separate piece of data. This is so that all channels of the same trial take the same subject evaluation scores accordingly. This simple augmentation will arrive at 40,960 pieces of data for EEG signals, 10,240 for periphery signals, 51,200 for data-level fusion, and 3840 components for each piece under the same conditions as above. The augmented raw data were normalized along the channel and then fed into our SDA network for feature extraction and classification.

Human-Designed Features
Pattern recognition algorithms are typically a combination of various features and learning algorithms. Generally, the pre-processed raw data were hard to process directly due to its large volume and the inabilities of classifiers to extract discriminative information from the data. Feature engineering is an essential way of taking advantage of prior knowledge and human ingenuity to compensate for the weakness of current learning models. Therefore, we must resort to features that characterize the most crucial information in raw signals. In emotion recognition, frequently utilized human-designed features can be classified into three categories: statistics derived from the time domain; the frequency domain; and the time-frequency domain. As illustrated in Figure 3 and Table 2

Human-Designed Features
Pattern recognition algorithms are typically a combination of various features and learning algorithms. Generally, the pre-processed raw data were hard to process directly due to its large volume and the inabilities of classifiers to extract discriminative information from the data. Feature engineering is an essential way of taking advantage of prior knowledge and human ingenuity to compensate for the weakness of current learning models. Therefore, we must resort to features that characterize the most crucial information in raw signals. In emotion recognition, frequently utilized human-designed features can be classified into three categories: statistics derived from the time domain; the frequency domain; and the time-frequency domain. As illustrated in Figure 3 and Table  2 a total of 1896 EEG features and 248 periphery features were extracted.

Domain Features
Time Power, mean, standard deviation, normalized 1st difference, normalized 2nd difference, mobility, complexity, non-stationary index, higher-order crossings of each channel, fractal dimension for each EEG channel.

Frequency
Power-law index, averaged band power of four frequency bands, BE1, BE2, MMOB, FOSM derived from the bispectrum of each EEG channel; magnitude squared coherence estimate, differential asymmetry, and rational asymmetry averaged on

Frequency
Power-law index, averaged band power of four frequency bands, BE1, BE2, MMOB, FOSM derived from the bispectrum of each EEG channel; magnitude squared coherence estimate, differential asymmetry, and rational asymmetry averaged on four frequency bands of 14 electrode pairs.

Time-frequency
Entropy and the absolute logarithmic recursive energy efficiency of detail coefficients at four decomposition levels, root mean square of the energy of four frequency bands; averaged energy of four frequency bands of the Hilbert-Huang spectrum of each channel.

Time Domain
• Statistics of signals The most common quantization terms are calculated entirely in the time domain, such as power reflecting the energy of the signal, the average over the whole-time interval representing the approximation with the lowest resolution in discrete wavelet transform, the variance that characterizes the degree of deviation, and difference (the approximate derivative of the discrete-time series). The power, mean, standard deviation, normalized 1st difference, and normalized 2nd difference of the 40 channels of one trial are computed.
Complexity : • Non-stationary index (NSI) The NSI proposed in [16] measures the complexity of the time series and can capture the non-stationary degree of the signals. The temporal variations of the amplitude of EEG signals are involved, irregular, turbulent, and non-periodic, yet they exhibit long-range correlations over most time scales, indicating the presence of self-invariant and self-similar structures. NSI is easy to implement. It is sufficient to divide the time sequence to be analyzed into several segments, then calculate the local averages of sections, and finally compute the variance of the local averages.

•
Fractal dimension (FD) Another indicator for describing the complexity or chaos of EEG signals is the FD, whose stable and precise values are generally derived through the Higuchi algorithm [17]. Fractal dimension-based features build on evidence that EEG signals can be regarded as a fractal curve, each part of which can be seen as a reduced-scale image of the whole. The calculation procedure is as follows.
First, consider a finite set of time-series observations sampled at a regular interval ξ(t): Second, construct a new time sequence, ξ m k , from the above-given signal: where [·] denotes the Gauss notation for the floor function; k denotes the time interval; m = 1, 2, · · · , k is the initial time point of each subsequence; both k and m are integers. Next, the length of each sub-curve ξ m k is defined as: The term T−1 is a normalization factor. The sum term represents the sum of the absolute values of the forward difference to the m th set. Then, the length of the time series or the curve for the time interval k, L(k) , is defined as the mean over k sets of L m (k). Last, if there is L(k) ∝ k −D ξ , then the EEG curve is fractal with the dimension D ξ . Note that the FD has not been utilized on periphery physiological channels.

•
Higher-order crossing (HOC) Higher-order statistics are found to be more effective. The zero-crossings count of a finitely long real-valued zero-mean series can account for the oscillating attitude around the zero levels, the values of which will consequently vary if a filter is applied to the time series. Naturally, a specific sequence of filters applied to a time series will result in specific zero-crossings counts, which are the so-called HOC-based features that can be viewed as a measure of oscillation pattern in the manner that the stronger the oscillation, the higher the zero-crossing rate, and vice versa. Herein, enlightened by [18], we apply a sequence of high-pass filters based on the backward difference operator to the zero-mean time series ξ(t).
where ∇ k is then iteratively applied backward to the difference operator, and k = 1, 2, · · · , 10 is the order. The resulting k features, i.e., D k , consist of the zero-crossings count of the filtered time series ζ k {ξ(t)}. As shown in Figure 4, the values of ten HOC features of the 32 EEG channels are presented in a heat map, and the lighter the color, the larger the value. As the order increases, the zero-crossing count decreases. Higher-order crossings can also apply to the periphery signals.
where ∇ is then iteratively applied backward to the difference operator, and = 1, 2, ⋯ , 10 is the order. The resulting features, i.e., , consist of the zero-crossings count of the filtered time series { ( )}. As shown in Figure 4, the values of ten HOC features of the 32 EEG channels are presented in a heat map, and the lighter the color, the larger the value. As the order increases, the zero-crossing count decreases. Higher-order crossings can also apply to the periphery signals.

Frequency Domain
Signal processing has always relied on power spectral density (PSD) analysis as a conventional instrument for analyzing time series. In particular, when the power spectrum follows the power-law P( f ) ∝ ( f ) −η , where P( f ) denotes the power spectrum density and f in hertz, then the exponent η can also be considered as the index for characterizing the irregularity of signals. It is formally defined as the Fourier Transform (FT) of the autocorrelation sequence (the Wiener-Khintchine theorem, which assumes the stationarity of stochastic signals): where R ξξ := E[ξ * (n)ξ(n + m)] is the autocorrelation sequence. Since ξ in this article is definitely real-valued, and its autocorrelation is summable (a sufficient but necessary condition for the power spectrum), the power spectrum exists and is real-valued, nonnegative, and symmetrical.
To estimate the PSDs of power signals that do not satisfy the generalized stationary assumption, we may use Welch's averaged, modified periodogram [19]. First, chop a length N signal into length L overlapping segments of equal length M, i.e., N = L * M; then window each slice and apply Fourier transform to determine the frequency components at that slice; finally, let the power spectrum of the entire signal by the average overall pieces of data. Herein, an 853-point FFT with a 426-point (half of the window) overlapped Hamming window function w was adopted to generate the PSDs. An example of the power spectral density of EEG signals is illustrated in Figure 5a. The power spectral density curves show a mild trend of the long tail. Spectrogram, the most basic method, relies on conventional Fourier spectral analysis, which assumes piecewise stationarity of the data. The typical implementation of the spectrogram is the Short-Time Fourier Transform (STFT), with which a time-frequency distribution can be obtained by successively sliding the window along the time axis and performing the Fast Fourier Transform on each time window. The uncertainty principle in the field of time and frequency analysis demonstrates that increased time resolution reduces frequency resolution and vice versa. STFT suffers from inaccurate time-frequency positioning due to this conflicting phenomenon. Therefore, it will only serve to compute the band power as described above and will not be further discussed.

•
Band Power Power in four major frequency bands: θ(3-7Hz), α , β , and γ(30-47Hz) are the most popular features in EEG-based emotion recognition. Short-Time Fourier Transform is more commonly used to calculate the approximate average power of each band because it is more robust to noise than Fourier Transform, and Figure 5b shows the spectrum of an EEG channel in 30 s at 347 Hz. As one can see, the energy is mainly concentrated in θ and α bands in this case. Here, we adopt the Hamming window with length of 1000 ms with no overlap. The final computed features comprise the logarithm of the average, the logarithm, of the maximum and minimum power of each band, the power variance across all frequency bands, and the average power ratio of β over α.
Spectrogram, the most basic method, relies on conventional Fourier spectral analysis, which assumes piecewise stationarity of the data. The typical implementation of the spectrogram is the Short-Time Fourier Transform (STFT), with which a time-frequency distribution can be obtained by successively sliding the window along the time axis and performing the Fast Fourier Transform on each time window. The uncertainty principle in the field of time and frequency analysis demonstrates that increased time resolution reduces frequency resolution and vice versa. STFT suffers from inaccurate time-frequency positioning due to this conflicting phenomenon. Therefore, it will only serve to compute the band power as described above and will not be further discussed.

•
Higher-order spectra (HOS) The higher-order spectral analysis is a non-linear signal processing framework, building on the fact that much more information can be obtained from an autocorrelation or power spectrum in a stochastic non-Gaussian signal. Higher-order spectra defined in terms of a signal's higher-order moments or cumulants contain this additional information. The higher-order spectral analysis assumes that the processes or signals of interest are zero means, just as higher-order crossings in the time domain because higher-order cumulants are invariant to mean shifts. Higher-order spectra include polyspectra, such as the bispectrum, bicoherence, trispectrum, etc. Here, we have only analyzed the bispectrum for simplicity. For an in-depth discussion of the definitions of "polyspectrum" and "cumulant" and their properties as well as implementation issues, the reader is encouraged to consult the Higher-Order Spectral-Analysis Toolbox [20].
Bispectrum serves to extract the phase information and characterize the properties of non-linear mechanisms such as quadratic and cubic patterns [21]. The features of the bispectrum were derived for the quantification of emotions in [2]. The bispectrum is defined as the Fourier Transform of the third-order cumulant sequence: Note that the bispectrum is a function of two frequencies. Herein, the 128 × 128 bispectra are estimated by firstly segmenting the signal into 30 nonoverlapping 128-sample slices (only the second half's recordings are used), next applying lag-domain windows to the computed cumulants in the nonredundant region, and finally computing a two-dimensional 128-point FFT of the windowed cumulants with shifting and rotating for proper orientation. Due to the vast volume (128 × 128 = 16,384) of the computed bispectrum, we turn to derive the four features proposed in [2] from the bispectrum data:

1.
Normalized bispectral entropy where Ω is the non-redundant or principal region, as illustrated in Figure 6. For a specific definition of the non-redundant region, the reader is referred to [20][21][22].
where L is the number of points within Ω.

4.
First order spectral moment (20) where N denotes the total number of elements in the bispectrum matrix.
Entropy 2022, 24, x FOR PEER REVIEW 12 of 32 1. Normalized bispectral entropy where Ω is the non-redundant or principal region, as illustrated in Figure 6. For a specific definition of the non-redundant region, the reader is referred to [20][21][22].

Normalized bispectral squared entropy
3. Mean-magnitude of bispectrum where L is the number of points within Ω.

First order spectral moment
where denotes the total number of elements in the bispectrum matrix.
The quantities mentioned above are all computed based on the recordings of one single electrode (except for the average power ratio of β over α). Yet, researchers have found that some electrodes in different scalp regions have a joint reaction while experiencing different affective states [1]. MSCE features describe the correspondence of two signals coming from different channels ξ i and ξ j at corresponding frequency, taking values between 0 and 1 [15]. The MSCE is given in Equation (21): where C ij ( f ) is the MSCE of ξ i ; ξ j indicates how well ξ i corresponds to ξ j at each frequency; P ij is the cross-power spectral density; P i and P j denote the power spectral density of ξ i and ξ j , respectively.
• Differential asymmetry and rational asymmetry According to neuroscientific findings concerning hemispheric asymmetry and its relationship to emotion [23], features based on the combination of symmetrical pairs of electrodes have garnered attention. They are typically classified as differential asymmetry and rational asymmetry between the electrode pairs symmetrically distributed on the cerebral cortex in the international 10-20 system. The former one usually refers to the average power difference of electrode pairs on the left/right hemisphere of the scalp: ∆x = x l − x r , and the latter refers to the ratio of the pairs: r x = x l /x r .

Time-Frequency Domain
It is not new to decompose a signal into its harmonic components. As early as 1677, Newton had carried out the spectral decomposition of light. The Fourier Transform led to modern signal analysis based on transform domain decomposition. The below three methods are designed to modify the global representation of Fourier analysis, but they have all failed in one way or another. In time-frequency analysis, signals are typically decomposed into a small number of components based on the local characteristic time scale. Accurate localization of frequencies is possible. The primary concern is the instantaneous frequency and energy rather than the global frequency and energy defined by Fourier spectral analysis. Time-frequency methods can provide additional information by considering the dynamic changes provided by the non-stationarity of signals, and presenting them as an energy-frequency-time distribution.

•
Hilbert-Huang spectrum (HHS) Using the Hilbert-Huang spectrum, one can localize any event in the time domain and on the frequency axis. The Hilbert-Huang spectrum can be derived by applying the Empirical Mode Decomposition (EMD) function and the Hilbert Transform (HT) function. With the EMD function, any complicated non-linear and non-stationary signal can be decomposed into a series of "intrinsic mode functions" that admit well-behaved Hilbert Transforms. Through the subsequent Hilbert Transform of Intrinsic Mode Functions (IMFs), which are derived from local properties of signals, the instantaneous frequencies and energy are obtained as time functions, resulting in a full energy-frequency-time distribution of the original data. The paper [24] summarized the necessary conditions to represent a nonlinear and non-stationary time series as completeness, orthogonality, locality, and adaptivity, in which completeness guarantees the precision in the expansion and orthogonality avoids leakage, locality serves nonstationary signals that have no time scale, and adaptivity is crucial for avoiding harmonic distortion of non-linear phenomena.
An IMF component involves only one oscillatory mode so that the Hilbert Transform can provide the full description of the frequency content. Contrary to the other expansion methods, the basis of HHS has derived from the raw data of the trait of riding waves expressed as one undulation riding on top of another, and they, in turn, are riding on still other fluctuations so on [24]. The sifting process of systematically extracting the IMFs is expanded in detail, as shown in Algorithm 1. The local maxima and local minima separately define the upper and lower envelopes. Algorithm 1. Hilbert-Huang spectral analysis [24] Input: The original signal vector x. Output: The Hilbert-Huang spectrum, i.e., an energy − time-frequency distribution of x.

15.
LocalMin ij , IndMin ij ← min seg ij 16. As depicted in Figure 7, the first IMF component contains information on the whole frequency range of Hz ), yet the other IMFs include knowledge of much fewer frequencies. For example, the third IMF shows a soft spike at about 21 Hz and an extremely sharp peak at around 44 Hz, and the nineth IMF shows high energy at 23 and 34 Hz. •

Discrete wavelet transform (DWT)
The multi-resolution wavelet approach, gloriously called "the mathematical microscope", has attracted widespread popularity. It is essentially a linear Fourier spectral analysis with adjustable windows. Unlike the STFT, the wavelet analysis is adaptive to the fluctuations of signals by adjusting the width of the time window and frequency scale. The general continuous definition of WT is: where is the dilation factor; √ gives the frequency scale; is the translation of the origin and gives the temporal location of an event; ( ) is the mother wavelet function. The physical sense of Equation (22) can be interpreted as the energy of ( ) of scale at time .
Here, we use the orthogonal and compactly supported the fourth order Daubechies wavelet to decompose the signals into detail coefficients of four levels and approximation coefficients of the fourth level, designated as D1, D2, D3, D4, and A4, respectively. Further, given the sampling rate of preprocessed signals in DEAP is 128 Hz, the corresponding  The multi-resolution wavelet approach, gloriously called "the mathematical microscope", has attracted widespread popularity. It is essentially a linear Fourier spectral analysis with adjustable windows. Unlike the STFT, the wavelet analysis is adaptive to the fluctuations of signals by adjusting the width of the time window and frequency scale. The general continuous definition of WT is: where a is the dilation factor; 1 √ a gives the frequency scale; b is the translation of the origin and gives the temporal location of an event; ψ(t) is the mother wavelet function. The physical sense of Equation (22) can be interpreted as the energy of x(t) of scale a at time b.
Here, we use the orthogonal and compactly supported the fourth order Daubechies wavelet to decompose the signals into detail coefficients of four levels and approximation coefficients of the fourth level, designated as D1, D2, D3, D4, and A4, respectively. Further, given the sampling rate of preprocessed signals in DEAP is 128 Hz, the corresponding frequency range of each set of coefficients is deduced by the criteria shown in Table 3. The entropy, root mean square (RMS), and the absolute logarithmic recursive energy efficiency (abs(log(REE))) of frequency bands θ, α, β, γ are derived by extending the approach introduced in [22]. Table 3. Frequencies corresponding to different decomposition levels for the "db4" wavelet with a sampling frequency of 128 Hz.

Feature Selection for Reducing the Dimensionality
An abundance of features will lead to over-specification, excessive computation load, and even overfitting when some features have little influence on the classification task. Achieving the smallest, most representative, distinguished, and robust feature subset that would yield the minimum generalization error is crucial.
Generally speaking, the feature selection procedure can be considered as an optimization problem for data pre-processing and typically involves two processes. First, a subset search is performed to select a subset via a particular search engine such as the greedy forward, greedy backward, and bidirectional search, among which: the forward search recursively appends relevant features to an initial set; on the contrary, the backward method continually discards irrelevant attributes from a complete feature set; the bidirectional way combines the other two. Besides, some randomized heuristic policies have been used for selection and showed efficiency, too. Second, a subset evaluation is performed to assess the suitability of the candidate subset via the above-mentioned miscellaneous criteria.
There are three general approaches to feature selection: filters, wrappers, and embedded methods. Here, only the first two are considered. Wrapper-type methods judge the availability of a feature by the actual accuracy of the specific classifier, thus guaranteeing a specialized non-redundant feature subset that matches best with the learning method, yet requires extensive computation, so a beforehand filtering is necessary. Filtering methods are model-independent and less computationally intensive, thereby fitting the feet of large datasets. The simplest baseline approach falling in filtering categories is removing features with low variance (RFLV), which is based on the idea that variables with milder undulations can have less impact on the results. The idea is similar to Principal Component Analysis (PCA) motivation, as well as the somewhat controversial Entropy Weight Method.
The relief method inspects the problem from a correlation perspective and chooses the variables with high association with the target. As with Linear Discriminant Analysis, the Fisher score assigns the highest score to features whose values of a different class are far from each other. In contrast, values of the same class are close to one another. One common practice of filtering methods based on mutual information is the Minimal-Redundancy-Maximal-Relevance (mRMR) framework, proposed in [5] to minimize the redundancy in feature sets obtained from the typical methods that rank genes according to some rule and pick the top-ranked ones. Mutual information itself can be used to measure the relevance or similarity of attributes.

Principal Component Analysis Compressed Representations
To examine the performance of manually extracted features, we used a simple and widely used dimension reduction method, principal component analysis, to extract the principal components of the data as reference features in a linear and unsupervised fashion. PCA learns a linear transformation f (x) = W T x + b of input data x ∈ R d x , in which the columns of the matrix W(d x × d h ) form an orthogonal basis for the dh orthogonal directions. The transformed result is the dh uncorrelated components. High-dimensional data can be compressed into a much smaller space while retaining most of the variation in the data set [25]. In many cases, global representation is needed to lessen the calculative burden or visualize samples for better analysis. PCA identifies the principal components or directions in the feature space along which the variance is maximal by the unsupervised combining of all the attributes. The so-called principal components are linear combinations of the original variables. Specifically, the first principal component is the direction in the data points that shows the greatest variation. The second principal component is the direction uncorrelated to the first component and maximizes the variance of samples when projected onto the component. If the data were standardized, then the principal components were the normalized leading eigenvectors of the covariance matrix. In this piece, signals in each channel are represented by the top 25 components whose cumulative contribution rate has reached nearly 100%, instead of 3840 variables to reduce data volume and, in the meantime, minimize information loss. As a result, there are 800 dimensions for EEG and 200 dimensions for the periphery for a trial.
However, the expressive power of linear features learned by PCA is rather limited. One cannot stack these linear features to form deeper representations since the composition of linear operations yields another linear operation.

Methods
The success of machine-learning algorithms in pattern recognition heavily relies on data representation as different representations stand for different explanatory factors of variation behind the data [26]. Features have a decisive impact on experimental performance. Designing features based on specific domain knowledge is labor-intensive and contrary to the pursuit of artificial intelligence. Traditional machine-learning algorithms may inherit some critical limitations from features. In contrast, representation-learning algorithms such as probabilistic models, autoencoders, manifold learning, and deep networks can be applied to learn more powerful representations and disentangle the underlying explanatory factors through deep learning whose performance is expected to be comparable to that of humans. They are less dependent on feature engineering. The more abstract and more useful representations are yielded from the composition of multiple non-linear transformations. This paper focuses on the autoencoder model among the various ways of learning representations. This section will present the principles of our semi-supervised method for higher-level feature extraction at length. The workflow diagram of our proposed method is presented in Figure 8.

Stacked Denoising Autoencoder (SDA)
In this section, we will explain the theories and terminologies of autoencoders and denoising autoencoders. We will also explain why we chose to stack an array of denoising autoencoders to reduce the dimensionality of the physiological emotion data.

Traditional Autoencoder (AE)
A low-dimensional code can reconstruct high-dimensional input vectors in a small central layer of a neural network. Such networks are called "autoencoder" or "autoassociator" and can work much better than PCA if the initial weights are close to a good solution [27]. Autoencoders can be seen as a non-linear generalization of PCA. An example of the twolayer autoencoder is illustrated in Figure 9, in which the "encoder" network transforms the high-dimensional data into a low-dimensional code and a similar "decoder" network recovers the data from the code. They are trained together by minimizing the discrepancy between the original data and its reconstruction version using Gradient Descent.

Stacked Denoising Autoencoder (SDA)
In this section, we will explain the theories and terminologies of autoencoders and denoising autoencoders. We will also explain why we chose to stack an array of denoising autoencoders to reduce the dimensionality of the physiological emotion data.

Traditional Autoencoder (AE)
A low-dimensional code can reconstruct high-dimensional input vectors in a small central layer of a neural network. Such networks are called "autoencoder" or "autoassociator" and can work much better than PCA if the initial weights are close to a good solution [27]. Autoencoders can be seen as a non-linear generalization of PCA. An example of the two-layer autoencoder is illustrated in Figure 9, in which the "encoder" network transforms the high-dimensional data into a low-dimensional code and a similar "decoder" network recovers the data from the code. They are trained together by minimizing the discrepancy between the original data and its reconstruction version using Gradient Descent. Figure 9. Illustration of a classical two-layer autoencoder. The encoder transforms input data to the central-layer activations, the decoder maps from the feature space back into the input space, producing a reconstruction.

Stacked Denoising Autoencoder (SDA)
In this section, we will explain the theories and terminologies of autoencoders and denoising autoencoders. We will also explain why we chose to stack an array of denoising autoencoders to reduce the dimensionality of the physiological emotion data.

Traditional Autoencoder (AE)
A low-dimensional code can reconstruct high-dimensional input vectors in a small central layer of a neural network. Such networks are called "autoencoder" or "autoassociator" and can work much better than PCA if the initial weights are close to a good solution [27]. Autoencoders can be seen as a non-linear generalization of PCA. An example of the two-layer autoencoder is illustrated in Figure 9, in which the "encoder" network transforms the high-dimensional data into a low-dimensional code and a similar "decoder" network recovers the data from the code. They are trained together by minimizing the discrepancy between the original data and its reconstruction version using Gradient Descent.  One concern and theoretical shortcoming regarding traditional autoencoders is whether it will fail to learn a useful representation when the number of units is not strictly decreasing from one layer to the next. Under such a circumstance, it is theoretically feasible for autoencoders to just learn an identity mapping (simply copying the input, which generally yields zero reconstruction error) under such circumstances with a high capacity. Although Bengio's [28] experiments had shown that it generalizes well with non-decreasing layers, stacking traditional autoencoders achieved only comparable performance compared with DBN.

Denoising Autoencoder
The denoising autoencoder is a straightforward variant of regular autoencoders and is trained to denoise the corrupted version of the input data. As depicted in Figure 10, com-pared to the original autoencoder, the denoising variant adds a polluting step as expressed in Equation (23), thus making the input of "encoder" the contaminated input data. Such a minor modification has been proven to significantly enhance the autoencoders' performance.
ing layers, stacking traditional autoencoders achieved only comparable performance compared with DBN.

Denoising Autoencoder
The denoising autoencoder is a straightforward variant of regular autoencoders and is trained to denoise the corrupted version of the input data. As depicted in Figure 10, compared to the original autoencoder, the denoising variant adds a polluting step as expressed in Equation (23), thus making the input of "encoder" the contaminated input data. Such a minor modification has been proven to significantly enhance the autoencoders' performance. Here, we calculate the corrupted input and the reconstructed as follows: are the weight matrices; is the computed feature vector; and denote the encoder and decoder activation function. Typically, the elementwise sigmoid or hyperbolic tangent nonlinearity is chosen for ; if is linear, the identity function is set for . An intuitive geometric hypothetical interpretation [29] for denoising autoencoders is given by manifold learning, which supposes high-dimensional data concentrate on a lowdimensional manifold. Corrupted examples that are far from the manifold will be projected back to be near the manifold via the training procedure. Here, we calculate the corrupted input x and the reconstructed z as follows: where f θ and g θ denote the closed-form parameterized encoder and decoder function, respectively. θ = {W, b} and θ = {W , b } are the parameter vectors minimizing the reconstruction error; b(d × 1) and b (d × 1) are the biases; W(d × d) and W (d × d ) are the weight matrices; h is the computed feature vector; s f and s g denote the encoder and decoder activation function. Typically, the elementwise sigmoid or hyperbolic tangent nonlinearity is chosen for s f ; if s g is linear, the identity function is set for s g . An intuitive geometric hypothetical interpretation [29] for denoising autoencoders is given by manifold learning, which supposes high-dimensional data concentrate on a low-dimensional manifold. Corrupted examples that are far from the manifold will be projected back to be near the manifold via the training procedure.

Stacked Denoising Autoencoder
Deep architectures can be much more efficient than shallow ones, as highly nonlinear and highly varying functions are compactly represented by deep multi-layer neural networks, suggested by the complexity theory of circuits. Stacked denoising autoencoder borrows the greedy principle of DBNs. It is an original way of building deep networks using stacking layers of denoising autoencoders.
Deeper autoencoders can generate lower reconstruction errors than shallower ones [27]. Adding more layers or stacking more one-hidden-layer autoencoders will increase the model's capacity, enabling the composition of more useful features. In many cases, the denoising criterion that guides the purely unsupervised learning bridges the performance gap of the stacked autoencoder with DBN and even surpasses it in many cases [29]. Higherlevel learned representations can boost the performance of subsequent classifiers.
The training procedure of stacked denoising autoencoders can be divided into two subprocesses: the unsupervised pre-training for searching for a proper initialization configuration for parameters in the network and the supervised fine-tuning for moderately adjusting the settings of the parameters. The process can be shown in Figure 8.

Unsupervised Pre-Training
A greedy layer-by-layer unsupervised "pre-training" is an algorithm to find the most appropriate initial configuration of weights and biases. This technique was first proposed by Hinton [30] and extended to cases where input is continuous by Bengio [28]. During the pre-training, layers are trained to represent the dominant factors of the data. Pre-training uses only the underlying knowledge behind the data, not the information contained in the labels, which will be applied in the fine-tuning stage to slightly adjust the weights found by pre-training.
It is believed to be capable of overcoming the challenges of deep learning (as stated in Section 3.1.1) by introducing prior knowledge to the fine-tuning procedure, where a similar argument is found in humans that their capacity to quickly become proficient in new tasks builds on much of what they have learned before facing that task. One reasonable and competing explanatory hypothesis is that the pre-training acts as an unusual regularizer [8]: pre-training would result in a regularization effect by establishing a particular initialization point of the parameter space of fine-tuning and restricting the parameters to be chosen in a relatively small, bounded, and local basin of attraction of the supervised cost function. This type of regularization has precedence in the neural networks in the form of "early stopping".
The training of ordinary deep architectures is a non-convex optimization problem. However, by initializing weights in a region near a local minimum, pre-training mitigates the complicated non-convex optimization, giving better generalization. Also, pre-training can facilitate the training process of deep networks by significantly reducing the total training time.

Global Fine-Tuning on a Supervised Criterion
For the greedy layer, the driving idea behind semi-supervised learning is to exploit information about the prior probabilities of input data to boost the generation of the classifier. Training cases that belong to the same cluster will be mapped into nearby embeddings.
Pre-training provides a good initialization for supervised fine-tuning of the whole network. The fine-tuning is the second and last stage of training. The parameters of the entire network can be further optimized simultaneously by gradient descent concerning a deterministically computable supervised criterion that depends on the representations [28]. It is worth mentioning that fine-tuning does not change the weights significantly, even with a high number of times of updating.

Deep-Belief Networks
A deep-belief network is another implementation of autoencoders. Pre-training and fine-tuning are still the main form of learning. The layer building block "Restricted Boltzmann Machine" is a bipartite graph modeled by an ensemble of binary vectors, where Gaussian or binary "visible" units are connected to stochastic, binary, and "hidden" feature detectors through symmetrically weighted connections. Restricted Boltzmann Machines (RBM) are a special case of Boltzmann Machine and Markov Random Fields. The joint probability of various values of units in the visible and hidden layer of the RBM is defined by energy: where Z is the partition function; v and h represent the states of units in the visible layer and the hidden layer, respectively. If the visible units are binomial, then the energy is calculated as: where v i and h j are the states of visible neuron i and feature j; b i and b j are their biases and W ij is the weight between them. Note that when the input layer units take real values, the where the first layer of feature detectors then become visible neurons for training the next RBM. Class autoencoders and RBMs are similar in their function form despite the enormous difference in the interpretation and training algorithms. After unsupervised training, an output logistic regression layer is added to estimate the class probabilities using multinomial logistic regression (soft-max).

Performance and Evaluation
To investigate and compare the performance of our proposed method with existing mainstream methods that can be summarized as combining one or more human-designed features with one classical learning algorithm, which can be traditional machine-learning methods or deep-learning approaches, we performed substantial experiments in this section, the experimental setup, and the results are elaborated below. All data preparations are detailed in Figure 11.

Hand-Engineered Features
As is exhibited in Figure 11, the number of derived features was huge, which may result in the fact that some features are irrelevant to others. Hence, before proceeding any further, we performed a hybrid filtering selection as a preliminary operation to reduce the dimensionality. The Scikit-learn Python package [31] was used for implementing the machine-learning models.
First, as a pre-step, we utilized the most fundamental RFLV to eliminate a small amount of less volatile features. The threshold was set as 0.01. As a result, a total of 148 EEG features with variances lower than 0.01 were picked out, among which were 18 from the higher-order spectra, 39 from the band power, 21 from the Hilbert-Huang spectrum, 24 from the differential asymmetry, 44 from the rational asymmetry, and 1 from the discrete wavelet transform. As for periphery modality, 31 features were removed, including 20 from the Hilbert-Huang spectrum, 3 from the discrete wavelet transform, 3 from the Hjorth, and 2 from the statistics in the time domain. The results can vary because the threshold was a hyperparameter of the unsupervised RFLV. As a result, the higher-order spectra and Hilbert-Huang spectrum fail to meet the expectation claimed in the literature.
Then, after eliminating features with small variation, we applied two supervised filtering methods and took the union of their choices as the input of learning algorithms with the mode-dependent wrapper selection. The motivation for us to do this was: as a rule, incorporating several state-of-the-art methods of feature selection surpasses a single method in performance, providing a broader coverage of the original space covered by the entire dataset, while emphasizing the discriminative efficiency. To decide which filtering methods to combine, we compared the performance of ten univariate and supervised filtering feature selection methods to recognize the valence dimension.
The result is presented in Table 4. It is clearly exhibited that the chi-square method gets marked (bold or underlined) by four models and wins first place in two of them. The method based on mutual information follows closely behind, being marked for three times and winning a championship twice. Accordingly, filtering based on chi-square and mutual information is incorporated to conduct further filtering after RFLV: the union of the top-ranked 200 features of both methods forms the final feature set for wrapper selection and classification. Specifically, in doing so, we arrive at 373 EEG components and 216 periphery dimensions for Valence, 377 and 215 for Arousal, and 365 and 215 for dominance, as shown in Figure 11. 2022, 24, x FOR PEER REVIEW 22 of 32 Figure 11. Raw data in DEAP is refined by a bunch of domain-specific algorithms, resulting in 1896 EEG features and 248 periphery features, which are further filtered by hybrid feature selection and taken as input for machine-learning models and the DNN CNN model. To examine the expression ability of hand-engineered representations, we reduce the raw data to its principal components using the widely-employed dimensionality reduction method PCA. The raw data were augmented, normalized, and fed into the SDA model for applying our semi-supervised feature-learning approach. The DBN model is taken as a reference in the meantime.

Hand-Engineered Features
As is exhibited in Figure 11, the number of derived features was huge, which may result in the fact that some features are irrelevant to others. Hence, before proceeding any further, we performed a hybrid filtering selection as a preliminary operation to reduce the dimensionality. The Scikit-learn Python package [31] was used for implementing the machine-learning models.
First, as a pre-step, we utilized the most fundamental RFLV to eliminate a small amount of less volatile features. The threshold was set as 0.01. As a result, a total of 148 EEG features with variances lower than 0.01 were picked out, among which were 18 from the higher-order spectra, 39 from the band power, 21 from the Hilbert-Huang spectrum, 24 from the differential asymmetry, 44 from the rational asymmetry, and 1 from the discrete wavelet transform. As for periphery modality, 31 features were removed, including 20 from the Hilbert-Huang spectrum, 3 from the discrete wavelet transform, 3 from the Hjorth, and 2 from the statistics in the time domain. The results can vary because the threshold was a hyperparameter of the unsupervised RFLV. As a result, the higher-order spectra and Hilbert-Huang spectrum fail to meet the expectation claimed in the literature.
Then, after eliminating features with small variation, we applied two supervised filtering methods and took the union of their choices as the input of learning algorithms with the mode-dependent wrapper selection. The motivation for us to do this was: as a rule, incorporating several state-of-the-art methods of feature selection surpasses a single method in performance, providing a broader coverage of the original space covered by the entire dataset, while emphasizing the discriminative efficiency. To decide which filtering methods to combine, we compared the performance of ten univariate and supervised filtering feature selection methods to recognize the valence dimension.
The result is presented in Table 4. It is clearly exhibited that the chi-square method gets marked (bold or underlined) by four models and wins first place in two of them. The method based on mutual information follows closely behind, being marked for three times and winning a championship twice. Accordingly, filtering based on chi-square and mutual information is incorporated to conduct further filtering after RFLV: the union of Figure 11. Raw data in DEAP is refined by a bunch of domain-specific algorithms, resulting in 1896 EEG features and 248 periphery features, which are further filtered by hybrid feature selection and taken as input for machine-learning models and the DNN CNN model. To examine the expression ability of hand-engineered representations, we reduce the raw data to its principal components using the widely-employed dimensionality reduction method PCA. The raw data were augmented, normalized, and fed into the SDA model for applying our semi-supervised feature-learning approach. The DBN model is taken as a reference in the meantime. Table 4. Comparison results of ten notable filtering-type feature selection schemes on several models for the valence scale are given. Accuracies of the top-two methods of each learning algorithm are in bold, and the best ones for each model are underlined besides. To investigate the performance of hand-engineered features and evaluate the contribution of both modalities, we conducted three binary classification tasks using seven widely-used machine-learning models in the literature. As mentioned above, a hybrid strategy of filter and wrapper type methods was employed for feature selection. The final feature subsets for three recognition tasks have been acquired.

Valence
Both cross-subject and single-subject classifications are performed. For cross-subject recognition, trials of all 32 participants are mixed, resulting in 1280 pieces of data. The results in Table 5 were obtained through five-fold cross-validation with the greedy forward wrapper search. For single-subject recognition, the 40 trials of one subject formed a tiny individual database. To refrain from overfitting that was prone to small data sets, we performed the leave-one-out-cross-validation for each subject, which can be viewed as 40fold ordinary cross-validation with only one piece of test data. The accuracy was calculated by dividing the total number of correct testing by 40. In addition, there can be features that can accurately distinguish dimensional size, resulting in 1.0 accuracy by itself. This was due to the small datasets that severely lack training cases. We have skipped such features in the experiment. Table 5. Accuracies of seven widely used conventional machine-learning models for cross-subject and single-subject recognition on three scales of valence, arousal, and dominance. The best result is shown in bold for each trial, and the more prominent one between the modality of EEG and periphery is underlined. Here, w stands for weight. To obtain a better result, decision fusion with equal weights and the optimal weights was also investigated. The exhaustive search found the optimal weights, where the step length was set as 0.01. The fused output was the weighted average of the two sets of output probabilities from EEG and periphery modality. As Table 5 and Figure 12 show, decision fusion with optimal weights outperformed the others, whereas fusion with equal weights showed the poorest performance. In this case, decision fusion using weighted average changes only changed the recognition result that the two modalities disagree on, while optimal weights tended to follow the opinion of the right modality.  Table 5. Accuracies of seven widely used conventional machine-learning models for cross-subject and single-subject recognition on three scales of valence, arousal, and dominance. The best result is shown in bold for each trial, and the more prominent one between the modality of EEG and periphery is underlined. Here, w stands for weight.  . 0.635 0.630 0.597 0.884 0.627 0.827 0.654 0.840 0.619 0.831 0.655 0.857 0.641 0.886 Ar. The accuracy of seven traditional learning algorithms for cross-subject and single-subject recognition on three scales is correct. The cross-subject classification tasks are in solid lines; instead, the single-subject ones are in dashed lines. Results using only the EEG modality are in red, green for using the periphery modality alone, blue for decision fusion with equal weights, and brown for decision fusion with the optimal weights found by exhaustive search.

Accuracy
In contrast, fusion with equal weights showed no preference, thus making more mistakes. Besides, distinct from the usual declarations in the literature that the periphery is inferior to EEG and plays a supporting role, we observed that the periphery modality drew equally with EEG for the valence scale in both cross-subject and single-subject recognition. For arousal and dominance, although, for the most part, the EEG outperformed the periphery modality, there were times that the periphery matched or beat EEG.
In addition, the times of various hand-engineered features selected by the machinelearning algorithms for both cross-subject and single-subject recognition on the valence, Figure 12. The accuracy of seven traditional learning algorithms for cross-subject and single-subject recognition on three scales is correct. The cross-subject classification tasks are in solid lines; instead, the single-subject ones are in dashed lines. Results using only the EEG modality are in red, green for using the periphery modality alone, blue for decision fusion with equal weights, and brown for decision fusion with the optimal weights found by exhaustive search.
In contrast, fusion with equal weights showed no preference, thus making more mistakes. Besides, distinct from the usual declarations in the literature that the periphery is inferior to EEG and plays a supporting role, we observed that the periphery modality drew equally with EEG for the valence scale in both cross-subject and single-subject recognition. For arousal and dominance, although, for the most part, the EEG outperformed the periphery modality, there were times that the periphery matched or beat EEG.
In addition, the times of various hand-engineered features selected by the machinelearning algorithms for both cross-subject and single-subject recognition on the valence, arousal, and dominance dimensions were added up, as shown in Figure 13. Features derived from the discrete wavelet transform in the time-frequency domain were the most favored. The power of four frequency bands in the frequency domain and the mobility and complexity were defined by Hjorth features in the time domain. Most features in the time domain were running neck-and-neck; however, surprisingly, the Hilbert-Huang spectrum and electrode pairs were not as popular as expected.

Using Deep Neural Networks to Compare with PCA-Compressed Features
To further verify the effectiveness of the well-designed features elaborated in Section 2.3, we will probably use PCA's oldest feature extraction method as a reference.
We compared the performance of two feature sources: hand-crafted features introduced in Section 2.3 and PCA-compressed features mentioned in Section 2.4. The need for a special note was that only cross-subject recognition were to be performed using deep architectures, and single-cross recognition was merely conducted in Section 4.1.1. In this experiment, all the 1748 features for EEG and 216 for the periphery modality were used without further filtering, except with RFLV. For PCA, the top 25 principal components were retained, so there were 800 components for EEG and 200 for the periphery. Feature fusion by concatenating the two feature vectors is also implemented. A three-hidden-layer deep neural network is used, and the dropout probabilities were set to 0.45, 0.5, and 0.25, respectively. Specifically, we employed a 1748-800-200-20-2 network for hand-engineered EEG, a 217-100-50-10-2 network for the hand-engineered periphery, and a 1965-500-100-20-2 network for the fusion of hand-crafted features. The number of neurons in three hidden layers of PCA were 500-100-20, 100-100-20, and 500-100-20 for the three feature sources. The dataset was divided into a training set with a size of 768, a validation set for adjusting hyperparameters such as learning rate and the number of training epochs, and a testing set with a size of 256. The five-fold cross-validation was also used here.
Our loss function consisted of four terms to avoid overfitting and to reduce the impact of category imbalances: ( ) = * crossenropyerror + * 1 + * 2 + * (29) where , , , are the coefficients of the four terms; 1 and 2 denote the 1 regularization loss and the 2 regularization loss, respectively; = + , represents the cost of misclassification caused by class imbalance. The four coefficients and other hyperparameters are adjusted according to the crossentropy error.
The results are shown in Table 6. Hand-engineered features outmatch the unsupervised PCA-compressed ones in all the comparisons, demonstrating these human-designed representations' serviceability and superiority.

Using Deep Neural Networks to Compare with PCA-Compressed Features
To further verify the effectiveness of the well-designed features elaborated in Section 2.3, we will probably use PCA's oldest feature extraction method as a reference.
We compared the performance of two feature sources: hand-crafted features introduced in Section 2.3 and PCA-compressed features mentioned in Section 2.4. The need for a special note was that only cross-subject recognition were to be performed using deep architectures, and single-cross recognition was merely conducted in Section 4.1.1. In this experiment, all the 1748 features for EEG and 216 for the periphery modality were used without further filtering, except with RFLV. For PCA, the top 25 principal components were retained, so there were 800 components for EEG and 200 for the periphery. Feature fusion by concatenating the two feature vectors is also implemented. A three-hidden-layer deep neural network is used, and the dropout probabilities were set to 0.45, 0.5, and 0.25, respectively. Specifically, we employed a 1748-800-200-20-2 network for hand-engineered EEG, a 217-100-50-10-2 network for the hand-engineered periphery, and a 1965-500-100-20-2 network for the fusion of hand-crafted features. The number of neurons in three hidden layers of PCA were 500-100-20, 100-100-20, and 500-100-20 for the three feature sources. The dataset was divided into a training set with a size of 768, a validation set for adjusting hyperparameters such as learning rate and the number of training epochs, and a testing set with a size of 256. The five-fold cross-validation was also used here.
Our loss function consisted of four terms to avoid overfitting and to reduce the impact of category imbalances: where λ 0 , λ 1 , λ 2 , λ 3 are the coefficients of the four terms; L1 and L2 denote the L1 regularization loss and the L2 regularization loss, respectively; C = C 01 + C 10 , C represents the cost of misclassification caused by class imbalance. The four coefficients and other hyperparameters are adjusted according to the crossentropy error.
The results are shown in Table 6. Hand-engineered features outmatch the unsupervised PCA-compressed ones in all the comparisons, demonstrating these human-designed representations' serviceability and superiority.

Automatically Generated High-Level Representations
This section compares the expressive power of several deep-learning models on an RTX 2060 GPU, including two discriminative models (DNN, CNN) and two generative models (DBN, SDA). The augmented data were utilized as input after normalization.

Supervised Discriminative Architectures
We focused on deep neural networks (DNN) and convolutional neural networks (CNN) for discriminative models. With the popularity of deep learning, DNN and CNN have attracted the most attention. These networks use the original data taken as input directly. Therefore, each layer in these networks can become a special feature extractor without human intervention, similar to the idea of representation learning in generative models. However, generative models can learn more parameters without overfitting, requiring no feedback from the labels. For generative models, the parameters are constrained by the number of bits required to describe the input for each training case. Yet, for discriminant models, the parameters are constrained by only the number of bits needed to describe the label [29]. We ignored the internal differences between the two networks and just used them as examples of discriminative models to compare feature extraction performance with generative models.
A 3840-800-200-20-2 network was designed for DNN with dropout probabilities set as 0.45, 0.5, and 0.25 for the three hidden layers. Each affine layer was followed by a batch normalization layer and a ReLU activation layer. As illustrated in Figure 14a, the CNN model deployed in the experiment consists of four modules: the first two submodules contain a convolutional layer followed by batch normalization and ReLU; we also added an additional max-pooling layer for the last two modules. The kernel size was set as three, the striding length was set as two, and no padding was applied. A 0.85 dropout probability was set for the fully connected layer. Each piece of data was converted to a 60 × 64 size image.

Semi-Supervised Generative Architectures
Basic concepts of the stacked denoising autoencoder (SDA) and Deep Belief Network (DBN) are similar. The data were first transformed into a new representation by unsupervised pre-training, and the supervised classifier was layered on top to map the representations into class predictions. Indeed, both minimizing the reconstruction error for autoencoders and contrastive divergence training for RBMs can be seen as an approximation of a log-likelihood gradient. However, stacking ordinary autoencoders regularly underperforms stacking-restricted Boltzmann machines. Our method of stacking denoising autoencoders exhibited superior performance in this section.

Semi-Supervised Generative Architectures
Basic concepts of the stacked denoising autoencoder (SDA) and Deep Belief Network (DBN) are similar. The data were first transformed into a new representation by unsupervised pre-training, and the supervised classifier was layered on top to map the representations into class predictions. Indeed, both minimizing the reconstruction error for autoencoders and contrastive divergence training for RBMs can be seen as an approximation of a log-likelihood gradient. However, stacking ordinary autoencoders regularly underperforms stacking-restricted Boltzmann machines. Our method of stacking denoising autoencoders exhibited superior performance in this section.
Considering that our data were real-valued, we used a linear decoder with a squared reconstruction error for SDA, whose structure is shown in Figure 14b. For applying noise to the data, we utilized the standard "dropout" method, i.e., resetting values of a fixed random proportion of original data elements to zero. In particular, the proportion was set at 0.2. The fine-tuning output layer takes input only from the top hidden layer. Once the stack of autoencoders has been trained and the stack of encoders has thus been built, the highest-level output representations would be used as input for a stand-alone supervised learning algorithm. A K-nearest neighbor model and a support vector machine with the radial basis function kernel were implemented as classifiers.
To further investigate the mechanism of action of SDA, we conducted several other experiments. These were to verify the effect of whether to pretrain the depth of stacking and the noise level added in. A 3840-1000-200-40-2 autoencoder was employed. The results are shown in the following Figure 15. Considering that our data were real-valued, we used a linear decoder with a squared reconstruction error for SDA, whose structure is shown in Figure 14b. For applying noise to the data, we utilized the standard "dropout" method, i.e., resetting values of a fixed random proportion of original data elements to zero. In particular, the proportion was set at 0.2. The fine-tuning output layer takes input only from the top hidden layer. Once the stack of autoencoders has been trained and the stack of encoders has thus been built, the highest-level output representations would be used as input for a stand-alone supervised learning algorithm. A K-nearest neighbor model and a support vector machine with the radial basis function kernel were implemented as classifiers.
To further investigate the mechanism of action of SDA, we conducted several other experiments. These were to verify the effect of whether to pretrain the depth of stacking and the noise level added in. A 3840-1000-200-40-2 autoencoder was employed. The results are shown in the following Figure 15.
With or without the unsupervised pre-training, it should be pointed out that the supervised optimization objective and the gradient-based scheme are the same. The only thing that differs is the starting point in the parameter space: either randomly selected or obtained by pre-training. The average squared reconstruction error per test data during fine-tuning is shown in Figure 15a. Pre-trained autoencoder has a much lower initial error than those without pre-training. Although both decrease the error to near zero in our task, with or without pre-training, pre-training makes progress faster, thus needing less fine-tuning.
The average squared reconstruction error on the test set is shown in Figure 15b during the fine-tuning. Initially, the error of a deep 3840-1000-200-40-2 autoencoder was greater than that of a the shallower 3840-1000-200-2 autoencoder at first, but the ultimate error of the deeper one went down more rapidly. However, we observed that the hypothesis that the deeper the network, the better the performance does not hold. An appropriate increase in the depth of the network can bring benefits, while an overly large depth seemed to increase the probability of finding poor local minima, resulting in bad performance. A deep 3840-2000-1000-400-100-20-2 autoencoder always got stuck in poor solutions, whereas a shallower 3840-1000-200-40-2 was hardly trapped. With or without the unsupervised pre-training, it should be pointed out that the su pervised optimization objective and the gradient-based scheme are the same. The onl thing that differs is the starting point in the parameter space: either randomly selected o obtained by pre-training. The average squared reconstruction error per test data durin fine-tuning is shown in Figure 15a. Pre-trained autoencoder has a much lower initial erro than those without pre-training. Although both decrease the error to near zero in our task with or without pre-training, pre-training makes progress faster, thus needing less fine tuning.
The average squared reconstruction error on the test set is shown in Figure 15b dur ing the fine-tuning. Initially, the error of a deep 3840-1000-200-40-2 autoencoder wa greater than that of a the shallower 3840-1000-200-2 autoencoder at first, but the ultimat error of the deeper one went down more rapidly. However, we observed that the hypoth esis that the deeper the network, the better the performance does not hold. An appropriat increase in the depth of the network can bring benefits, while an overly large dept seemed to increase the probability of finding poor local minima, resulting in bad perfor mance. A deep 3840-2000-1000-400-100-20-2 autoencoder always got stuck in poor solu tions, whereas a shallower 3840-1000-200-40-2 was hardly trapped.
The average squared reconstruction error of different levels of corruption during th fine-tuning on the valence scale using EEG modality only is shown in Figure 15c. A 3840 1000-200-40-2 autoencoder was employed with 200 epochs of pre-training. The larger th noise level, the more prominent the effect of pre-training, and the fewer fine-tunings re quired.
The results of the four deep architectures are shown in Table 7 and Figure 16. Th two widely-used discriminative networks and DBN showed similar performances an were slightly defeated by the two SDA models. When it comes to SDA, the SVM shows slight advantage (shown in bold). The average squared reconstruction error of different levels of corruption during the fine-tuning on the valence scale using EEG modality only is shown in Figure 15c. A 3840-1000-200-40-2 autoencoder was employed with 200 epochs of pre-training. The larger the noise level, the more prominent the effect of pre-training, and the fewer fine-tunings required.
The results of the four deep architectures are shown in Table 7 and Figure 16. The two widely-used discriminative networks and DBN showed similar performances and were slightly defeated by the two SDA models. When it comes to SDA, the SVM shows a slight advantage (shown in bold).  Figure 16. Accuracies of five deep-learning approaches using the augmented raw data as input on three scales of the dimensional effect model. Results of pure EEG and pure periphery are in red and green, respectively; data fusion of both modalities are in blue.

Conclusions and Future Work
In this paper, we explored learning mapping from input to a novel representation. To circumvent the labor of artificially designed features and leverage the power of deeplearning instruments, we proposed to acquire affective and robust representations automatically through the SDA architecture with unsupervised pretraining, followed by supervised fine-tuning using backpropagation of error derivatives. Performances of the different features and models were compared through three binary classification tasks based on the renowned Valence-Arousal-Dominance affection model. We conducted the following three experiments: 1. Used EEG and periphery signals to generate hand-engineered features. (a) We performed decision fusion using traditional machine-learning algorithms with hybrid feature selection; (b) We performed feature fusion and used deep neural networks to compare handengineered features with PCA-compressed features. 2. Used augmented data as input after normalization. (a) We fed the augmented data directly into the SDA (feature extractor) after properly pre-processing and then combined the generated features with the classifier for network fine-tuning. To compare the practical effects of our method, we took two discriminative deep architectures (DNN and CNN) and another generative stacked semi-supervised architecture (DBN) as references.
The results of the mentioned experiments are listed: (1) It turns out that the fusion data achieve performances better than, or in-between the two modalities. Accuracies of single-subject recognition are about 20 percent higher than that of cross-subject classification using well-designed hand-engineered features, demonstrating the "highly-personal" Figure 16. Accuracies of five deep-learning approaches using the augmented raw data as input on three scales of the dimensional effect model. Results of pure EEG and pure periphery are in red and green, respectively; data fusion of both modalities are in blue.

Conclusions and Future Work
In this paper, we explored learning mapping from input to a novel representation. To circumvent the labor of artificially designed features and leverage the power of deeplearning instruments, we proposed to acquire affective and robust representations automatically through the SDA architecture with unsupervised pretraining, followed by supervised fine-tuning using backpropagation of error derivatives. Performances of the different features and models were compared through three binary classification tasks based on the renowned Valence-Arousal-Dominance affection model. We conducted the following three experiments:

1.
Used EEG and periphery signals to generate hand-engineered features.
(a) We performed decision fusion using traditional machine-learning algorithms with hybrid feature selection; (b) We performed feature fusion and used deep neural networks to compare hand-engineered features with PCA-compressed features.

2.
Used augmented data as input after normalization.
(a) We fed the augmented data directly into the SDA (feature extractor) after properly pre-processing and then combined the generated features with the classifier for network fine-tuning. To compare the practical effects of our method, we took two discriminative deep architectures (DNN and CNN) and another generative stacked semi-supervised architecture (DBN) as references.
The results of the mentioned experiments are listed: (1) It turns out that the fusion data achieve performances better than, or in-between the two modalities. Accuracies of singlesubject recognition are about 20 percent higher than that of cross-subject classification using well-designed hand-engineered features, demonstrating the "highly-personal" nature of emotion. Besides, statistics from the hybrid feature selection showed that features derived from the wavelet transform and the band power were the most preferred in emotionrecognition tasks. Moreover, the results of the feature fusion of hand-crafted features were 0.651 in valence; (2) Results reveal that our scheme (using the model SVM) with an accuracy of 0.652 in valence (data fusion) slightly outperformed the other three deep feature extractors, and also surpassed the state-of-the-art of hand-engineered features.
Due to the diversity and complexity of human emotions, there were still many problems in correctly recognizing them by machines. The main limitations of our work were as follows: 1.
The resulting accuracy was not that high as the model was relatively easy. We need to construct and optimize deep-learning modes by conducting further model fusion to further improve the results. 2.
The system needs to select more effective EGG signal features.