Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference

Qu, Wenli; Wang, Mu-Jiang-Shan

doi:10.3390/sym18030488

Open AccessArticle

Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference

by

Wenli Qu

¹

and

Mu-Jiang-Shan Wang

^2,*

¹

College of Arts Management, Shandong University of Arts, Jinan 250300, China

²

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(3), 488; https://doi.org/10.3390/sym18030488

Submission received: 9 February 2026 / Revised: 5 March 2026 / Accepted: 7 March 2026 / Published: 12 March 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Emotions and affective responses are core intervention targets in music therapy. Through acoustic elements, music can evoke emotional responses at physiological and neurological levels, influencing cognition and behavior while providing an important dimension for evaluating therapeutic efficacy. However, emotions are inherently abstract and difficult to represent directly. Artificial intelligence models therefore provide a promising tool for modeling and quantifying such abstract affective states from physiological signals. In this paper, we propose a structured and explicitly factorized multi-modal representation learning framework for joint affective state and preference inference. Instead of entangling heterogeneous dynamics within monolithic encoders, the framework decomposes representation learning into cross-channel interaction modeling and intra-channel temporal–spectral organization modeling. The framework integrates electroencephalography (EEG), peripheral physiological signals (GSR, BVP, EMG, respiration, and temperature), and eye-movement data (EOG) within a unified temporal modeling paradigm. At its core, a Dynamic Token Feature Extractor (DTFE) transforms raw time series into compact token representations and explicitly factorizes representation learning into (i) explicit channel-wise cross-series interaction modeling and (ii) temporal–spectral refinement via learnable frequency-domain gating. These complementary structural modules are implemented through Cross-Series Intersection (CSI) and Intra-Series Intersection (ISI), which perform low-rank channel dependency learning and adaptive spectral modulation, respectively. A hierarchical cross-modal fusion strategy integrates modality-level tokens in a representation-consistent and interaction-aware manner, enabling coordinated modeling of neural, autonomic, and attentional responses. The entire framework is optimized under a unified multi-task objective for valence, arousal, and liking prediction. Experiments on the DEAP dataset demonstrate consistent improvements over state-of-the-art methods. The model achieves 98.32% and 98.45% accuracy for valence and arousal prediction, 97.96% for quadrant classification in single-task evaluation, and 92.8%, 91.8%, and 93.6% accuracy for valence, arousal, and liking in joint multi-task settings. Overall, this work establishes a structure-aware and factorized multi-modal representation learning framework for robust affective decoding and intelligent music therapy systems.

Keywords:

structured representation learning; factorized modeling; cross-series interaction modeling; temporal–spectral modeling; affective computing; EEG and peripheral signals; token-based representation learning; music therapy

1. Introduction

Music is a powerful medium capable of eliciting rich emotional experiences and modulating human affective states. In clinical and therapeutic contexts, music has been widely adopted as a non-invasive intervention for emotional regulation, cognitive rehabilitation, and mental health treatment, particularly for conditions such as depression, anxiety, and stress-related disorders. Beyond therapy, understanding emotional responses to music also plays a crucial role in personalized recommendation systems, human–computer interaction, and affect-aware intelligent systems. Accurately decoding affective states and personal preferences from physiological responses has therefore become a fundamental problem in affective computing and computational neurophysiology [1,2]. Emotions induced by music are inherently complex and multidimensional. Rather than being expressed through a single observable channel, affective responses emerge from the coordinated activity of multiple physiological and neurological systems. Electroencephalography (EEG) reflects cortical dynamics and neural oscillatory patterns associated with emotional perception and cognitive appraisal, while peripheral physiological signals such as galvanic skin response (GSR), blood volume pulse (BVP), electromyography (EMG), respiration, and skin temperature capture autonomic nervous system responses related to arousal, stress, and valence. Eye-movement signals (EOG) further provide valuable cues about attention allocation and cognitive engagement. These heterogeneous modalities jointly encode complementary aspects of emotional experience, making multi-modal physiological analysis a promising yet challenging research direction. Despite substantial progress, existing emotion recognition approaches still exhibit several limitations. Early studies primarily relied on handcrafted features and shallow classifiers, often focusing on single modalities or limited signal subsets [3,4]. More recent deep learning-based approaches improve representation capacity but frequently adopt coarse-grained emotion formulations, such as binary classification or low-cardinality categorical schemes (e.g., valence–arousal quadrants) [5,6]. While effective in simplified settings, such formulations fail to capture the fine-grained variability of emotional intensity and subjective preference, which is particularly critical in music therapy scenarios. Another important limitation lies in task formulation and representation structure. Most prior works treat affective dimensions—such as valence, arousal, or preference—as independent prediction problems and optimize them in isolation [7,8]. However, extensive psychological and neuroscientific evidence suggests that these dimensions are strongly interrelated. Ignoring such interdependencies not only limits expressive power but also undermines generalization under subject variability and signal perturbations. From a representation learning perspective, multi-modal physiological affective inference can be formulated as a structured multi-level information integration problem. Effective decoding requires explicitly modeling cross-channel interaction patterns, capturing stable temporal–spectral organization within signals, and preserving coherent relationships across correlated affective dimensions. However, many existing architectures implicitly entangle these factors within unified encoders, resulting in limited structural interpretability and suboptimal representation consistency. Nevertheless, most existing multi-modal models rely on straightforward feature concatenation or shallow fusion strategies, which are insufficient to capture complex cross-modal dependencies among neural, autonomic, and attentional signals [9,10]. To address these challenges, we propose a unified multi-task framework for fine-grained emotion and preference recognition from multi-modal physiological signals, with a particular focus on music therapy applications. Our framework jointly processes EEG, peripheral physiological signals (GSR, BVP, EMG, respiration, and temperature), and eye-movement data (EOG), enabling comprehensive modeling of affective responses across cortical, autonomic, and behavioral domains. Unlike previous approaches that predominantly employ binary or coarse-grained classification schemes [11,12], our method supports fine-grained 9-class prediction for each affective dimension, including valence, arousal, and liking. At the core of the proposed framework lies a Dynamic Token Feature Extractor (DTFE), which transforms raw physiological time series into compact and discriminative token representations. DTFE explicitly decomposes representation learning into two complementary structural modeling components: (i) cross-series interaction modeling implemented through structured channel-wise attention mechanisms, and (ii) intra-series temporal–spectral modeling achieved via adaptive frequency-domain refinement. This structured decomposition is instantiated through Cross-Series Intersection (CSI) and Intra-Series Intersection (ISI), enabling explicit, modular, and interpretable modeling of inter-channel dependencies and intra-channel temporal dynamics. On top of modality-specific processing, we introduce a hierarchical cross-modal fusion mechanism that integrates modality-level representations in a representation-consistent and interaction-aware manner, thereby facilitating synergistic affective reasoning across heterogeneous modalities. The entire framework is trained end-to-end under a unified multi-task learning paradigm, allowing shared structural representations to be jointly optimized across related affective objectives. Building upon this structured and factorized perspective of multi-modal physiological modeling, the main contributions of this work are summarized as follows:

Structured multi-task affective modeling paradigm: We propose a unified modeling framework that jointly infers emotional valence, arousal, and music liking from physiological signals. By explicitly modeling inter-task relationships, this formulation promotes representation consistency across correlated affective dimensions and improves robustness compared with conventional single-task approaches.
Structured token-based representation learning: We introduce a token-based representation learning strategy in which affective representations are explicitly factorized into cross-series interaction modeling and intra-series temporal–spectral modeling components. This decomposition, realized via CSI and ISI, enables interpretable, modular, and structurally coherent physiological modeling.
Hierarchical cross-modal integration mechanism: We propose a hierarchical fusion mechanism that integrates neural, autonomic, and attentional modalities in a structure-aware and representation-consistent manner, preserving modality-level regularities while enabling adaptive cross-modal interaction.
Strong empirical performance and generalization: Extensive experiments on the DEAP dataset demonstrate that the proposed structured modeling framework achieves competitive performance under both single-task and multi-task learning settings.

By advancing structured and factorized representation modeling for physiological signals, this work addresses fundamental limitations of prior affective computing methods and establishes a principled foundation for robust affective inference, intelligent music therapy, and affect-aware human–computer interaction systems.

2. Related Work

2.1. Emotion Recognition Based on EEG Signals

Electroencephalography (EEG) is one of the most widely used modalities for emotion recognition due to its high temporal resolution and direct correlation with neural activity. Existing EEG-based emotion recognition methods can be broadly categorized into handcrafted feature approaches and deep representation learning models.

Early studies predominantly relied on handcrafted features extracted from EEG signals, such as frequency-domain power spectra, time-domain statistics, and entropy-based descriptors, followed by conventional classifiers for binary or multi-class emotion recognition [1,2,13]. These approaches are computationally efficient and interpretable, but their performance is highly dependent on domain-specific feature engineering and tends to degrade under cross-subject variability.

To overcome these limitations, more sophisticated learning-based models have been proposed. Hierarchical neural architectures [1], graph-regularized sparse models [14], and deep forest frameworks [15] were introduced to capture spatial and temporal dependencies among EEG channels. More recent advances further incorporate attention mechanisms to enhance representation capacity. For example, Song et al. [16] proposed a dynamical graph convolutional neural network to model inter-channel relationships, while Zhang et al. [17] introduced hierarchical self-attention to localize emotion-relevant temporal segments within EEG signals. Recently, Ding et al. [18] proposed EmT, a transformer-based framework designed for generalized cross-subject EEG emotion recognition, leveraging global attention to enhance subject-invariant representation learning. Similarly, MASA-TCN [19] introduces multi-anchor space-aware temporal convolutions to jointly model spatial topology and temporal dynamics, demonstrating strong performance on both continuous and discrete emotion settings. Xiao et al. [20] further developed AT-DGNN, which integrates graph-based learning and dynamic channel interaction mechanisms to capture music-driven and inter-channel emotional dependencies.

Models such as GLFANet [21] and DAST [22] additionally integrate spatio-temporal attention and domain adaptation strategies to improve generalization across subjects.

Beyond standard emotion recognition settings, EEG-based models have also been explored in special populations, such as patients with disorders of consciousness [23], as well as in real-time brain–computer interface (BCI) applications [24]. While these studies demonstrate the expressive power of EEG for affective analysis, most existing approaches implicitly entangle inter-channel interactions and intra-channel temporal dynamics within monolithic representations, making them sensitive to individual differences and EEG non-stationarity.

2.2. Emotion Recognition by Merging Multiple Physiological Signals

To address the inherent limitations of single-modality systems, an increasing body of work has investigated multimodal emotion recognition by integrating EEG with peripheral physiological signals, such as galvanic skin response (GSR), blood volume pulse (BVP), electromyography (EMG), respiration, and temperature. These approaches aim to exploit the complementary characteristics of cortical, autonomic, and behavioral responses to improve robustness and recognition accuracy.

Early multimodal methods primarily adopted ensemble learning or feature-level fusion. Representative examples include IRS [3] and MESAE [6], which combine ECG, GSR, EMG, and other signals using ensemble or attention-based architectures. More recent studies leverage deep learning to perform joint representation learning across modalities. Liu et al. [9] employed deep canonical correlation analysis (DCCA) to learn shared embeddings from EEG, GSR, and eye-movement signals, while Tang et al. [11] proposed a hierarchical fusion framework with contrastive alignment to enhance cross-modal consistency. Other approaches, such as BDAE [5], MM-ResLSTM [7], and FGSVM [8], utilize deep residual networks, recurrent encoders, or kernel-based fusion strategies to jointly model neural and peripheral physiological modalities.

Despite these advances, most multimodal emotion recognition models are formulated as single-task classification pipelines with limited output granularity, typically focusing on binary or low-cardinality emotion categories. Only a few methods, such as DEMA [12], support more fine-grained or multi-label emotion prediction. However, even these approaches primarily optimize individual affective dimensions in isolation and do not explicitly model the hierarchical structure or interdependencies among multiple affective attributes.

2.3. Summary and Motivation

In summary, existing studies on physiological emotion recognition—whether based on EEG alone or on multimodal signal fusion—have demonstrated the feasibility of decoding affective states from neural and peripheral responses. However, from a representation learning perspective, most prior approaches remain largely task-driven and data-driven, with limited emphasis on the intrinsic structural regularities and explicit interaction dependencies underlying physiological affective responses. As summarized in Table 1, the majority of existing methods are constrained to single-task settings and coarse-grained label spaces, such as binary or 4/5-class emotion schemes. It is important to note that these reduced-class formulations are typically designed under specific evaluation protocols and dataset configurations, and therefore are not directly comparable to fine-grained ordinal 9-class settings. Accordingly, Table 1 is intended to highlight differences in task formulation and modeling paradigm rather than to provide direct performance comparisons across heterogeneous evaluation settings. Such formulations restrict both the expressive capacity and structural interpretability of affective representations in complex scenarios such as music therapy. More critically, few prior models are designed to jointly handle multiple interdependent affective dimensions—such as valence, arousal, and liking—within a unified framework that explicitly models their inter-task relationships and shared representational structure. From a structural modeling perspective, many existing deep learning and attention-based approaches implicitly entangle cross-channel interactions and intra-channel temporal–spectral dynamics within unified encoders, without clearly decomposing these factors into complementary modeling components. As a result, important structural patterns—such as recurrent temporal dynamics within individual channels and coordinated interaction patterns across channels—are often captured implicitly rather than through deliberate architectural design. This implicit entanglement limits modular interpretability and weakens representation consistency across modalities and tasks, particularly under subject variability and multimodal heterogeneity. In contrast, the proposed method is explicitly formulated from a structured and factorized representation learning perspective. It supports fine-grained 9-class prediction for each affective dimension within a unified multi-task framework, while promoting representation consistency across correlated affective dimensions and heterogeneous modalities. By jointly modeling EEG, GSR, BVP, EMG, respiration, temperature, and EOG signals, and by explicitly decoupling cross-series interaction modeling from intra-series temporal–spectral modeling, the proposed framework addresses the representational bottlenecks of prior work. This structured decomposition enables high-resolution, multi-dimensional emotion and preference estimation with improved interpretability, stability, and generalization performance.

3. Methodology

We propose a structured and factorized multi-modal physiological modeling framework for recognizing emotional valence and musical preference during music therapy based on physiological responses. As illustrated in Figure 1, the overall system consists of four stages: (1) music-based stimulus presentation, (2) multi-channel physiological signal acquisition, (3) structured feature extraction via Dynamic Token Feature Extractors (DTFEs), and (4) hierarchical cross-modal fusion for joint affective inference. During the stimulus phase, participants are exposed to diverse music videos designed to evoke varying emotional states and stylistic preferences, while physiological signals are synchronously recorded. The collected signals include 32-channel electroencephalography (EEG), 6-channel peripheral physiological measurements (e.g., GSR, BVP, EMG, respiration, and skin temperature), and 2-channel electrooculography (EOG), providing complementary observations of cortical, autonomic, and attentional responses. After preprocessing, each modality is processed by a modality-specific DTFE module. EEG, peripheral, and eye-movement signals are transformed into domain-specific token representations, denoted as S, L, and I, respectively. Within each DTFE, representation learning is explicitly decomposed into cross-series channel interaction modeling and intra-series temporal–spectral refinement, producing compact yet discriminative feature embeddings under a structured factorization principle. The resulting modality-level representations are subsequently fed into a hierarchical cross-modal fusion module, which integrates modality tokens in a representation-consistent and interaction-aware manner. This design facilitates coordinated reasoning across heterogeneous physiological signals with distinct temporal characteristics while preserving structural coherence across modalities. Finally, the fused representations are jointly optimized to predict emotional valence and music preference within a unified multi-task learning framework. The entire model is trained end-to-end using a multi-objective loss that balances predictive accuracy and representational efficiency. Through this structured processing pipeline, the proposed framework provides an interpretable and modular mechanism for modeling how physiological signals encode emotional states and personal preferences in music therapy scenarios. Primary and Aligned Evaluation Settings. In this study, the primary modeling objective corresponds to a fine-grained 9-class multi-task configuration that jointly characterizes valence, arousal, and liking dimensions in a structured manner. Unless otherwise specified, all methodological formulations, architectural descriptions, and optimization strategies refer to this 9-class primary setting. For fair comparison with existing literature, we additionally report results under aligned 2-class and 5-class evaluation protocols commonly adopted in prior EEG-based emotion recognition studies. These reduced-class configurations are used solely for benchmarking purposes and do not modify the proposed structured factorization architecture or training mechanism.

3.1. Dynamic Token Feature Extractor (DTFE)

The proposed Dynamic Token Feature Extractor (DTFE) serves as the core feature encoder of the multi-modal framework, transforming heterogeneous physiological time-series into compact and task-discriminative token representations. As illustrated in Figure 2, DTFE follows a normalize–project–tokenize–intersect–decode pipeline, where learnable token matrices

{Q_{i}, Q_{w}, Q_{f}, Q_{o}}

function as adaptive operators for modeling temporal dynamics and channel interactions. Through explicit architectural factorization, DTFE introduces structured inductive biases that systematically capture cross-channel dependencies and intra-channel temporal–spectral patterns. Here, inductive biases denote the architectural assumptions embedded in the explicit factorization of inter-channel and intra-channel modeling, which guide the network toward structured and task-relevant representations. Compared with conventional sequence encoders (e.g., CNN or RNN), whose computational cost scales with sequence length, DTFE leverages token-based mixing to achieve both expressive representation learning and computational efficiency, making it well suited for multi-signal affective modeling.

3.1.1. Module Architecture

Input Formulation

For each modality, the input physiological signal segment is represented as

X \in R^{B \times C \times T},

(1)

where B denotes the batch size, C the number of channels (e.g.,

C = 32

for EEG,

C = 6

for peripheral signals, and

C = 2

for EOG), and T the temporal length of the analysis window (e.g.,

T = 1280

for 10 s windows at 128 Hz). This formulation explicitly separates inter-channel structure from intra-channel temporal dynamics, providing a clear foundation for subsequent interaction and temporal modeling.

Instance Normalization

To reduce inter-subject and inter-session amplitude variations while preserving discriminative temporal characteristics, DTFE applies instance normalization independently to each channel:

{\hat{X}}_{b, c, t} = \frac{X_{b, c, t} - μ_{b, c}}{σ_{b, c} + ϵ}, μ_{b, c} = \frac{1}{T} \sum_{t = 1}^{T} X_{b, c, t}, σ_{b, c} = \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {(X_{b, c, t} - μ_{b, c})}^{2}},

(2)

where

ϵ

ensures numerical stability. This normalization enhances robustness to subject-dependent baselines and stabilizes optimization across heterogeneous modalities.

3.1.2. Multi-Dimensional Projection

Channel-Wise Projection to Latent Tokens

After normalization, each channel sequence is projected into a d-dimensional latent space via a lightweight MLP implemented as point-wise linear layers along the channel axis:

X_{proj} (b, :, t) = W_{2}^{proj} ϕ (W_{1}^{proj} \hat{X} (b, :, t) + b_{1}^{proj}) + b_{2}^{proj},

(3)

where

W_{1}^{proj} \in R^{d \times C}

,

W_{2}^{proj} \in R^{d \times d}

,

b_{1}^{proj} \in R^{d}

,

b_{2}^{proj} \in R^{d}

are learnable parameters, and

ϕ (\cdot)

denotes a nonlinear activation function (ReLU). This projection step serves two purposes: (i) aligning heterogeneous modalities within a shared latent representation space, and (ii) enhancing representational capacity to capture subtle physiological cues associated with affective states. By performing channel mixing prior to temporal modeling, the framework establishes a structured feature space that facilitates subsequent interaction and temporal refinement.

3.1.3. Token-Based Processing

Learnable Temporal Tokenization

To replace fixed temporal pooling with adaptive aggregation, DTFE introduces learnable token mixing. Given the projected sequence

X_{proj} \in R^{B \times d \times T}

, a learnable tokenization matrix

Q_{i} \in R^{T \times T}

(4)

is applied to obtain

F_{i} = X_{proj} Q_{i} \in R^{B \times d \times T} .

(5)

Here,

Q_{i}

functions as a learnable temporal mixing operator. Each column of

Q_{i}

represents an adaptive temporal weighting pattern that selectively aggregates information across time steps. This mechanism enables the model to emphasize informative signal segments—such as transient autonomic responses, rhythmic neural oscillations, or event-related eye-movement patterns—while preserving flexibility in modeling non-stationary dynamics. Unlike fixed pooling or handcrafted filtering strategies,

Q_{i}

is optimized end-to-end, allowing temporal aggregation to be data-adaptive and task-driven.

Design Rationale

Physiological responses to music exhibit non-stationary temporal characteristics and modality-dependent dynamics. The token-based formulation in Equation (5) provides a unified and computationally efficient mechanism for adaptive temporal modeling. Since the dominant operations involve matrix multiplications on compact token representations, DTFE avoids heavy recurrent computation and enables scalable training and inference across multiple modalities.

3.1.4. Representation Learning

Intuitive Motivation

Physiological affective signals contain two fundamentally distinct yet intertwined sources of variability. First, coordinated responses across channels reflect system-level coupling patterns (e.g., neural–autonomic synchronization or bilateral ocular coherence). Second, each individual channel exhibits its own temporal and spectral dynamics, such as transient responses, oscillatory rhythms, or frequency-specific modulations [30]. Conventional monolithic encoders (e.g., CNNs or RNNs) typically entangle these cross-channel interactions and intra-channel dynamics within a single shared representation space. Such implicit mixing may lead to representation interference, where channel interaction patterns and temporal–spectral structures compete within the same parameter subspace, potentially reducing robustness under cross-subject or cross-session variations. By explicitly factorizing representation learning into Cross-Series Interaction (CSI) and Intra-Series Intersection (ISI), the proposed framework introduces a structured inductive bias that decouples inter-channel coordination from intra-channel refinement. This decomposition constrains the hypothesis space into physiologically meaningful subspaces, thereby reducing unnecessary coupling and improving generalization stability. The representation learning stage forms the core of the DTFE module, where tokenized features are progressively refined to capture both cross-channel dependencies and intra-channel temporal–spectral dynamics. Starting from the temporally tokenized representation

F_{i} \in R^{B \times d \times T}

, DTFE employs two complementary components in a cascaded manner:

F_{CSI} = CrossSeriesIntersection (F_{i}), F_{ISI} = IntraSeriesIntersection (F_{CSI}),

(6)

where B denotes the batch size, T the temporal length, and d the latent feature dimension. The CSI module models structured inter-channel interactions, while ISI refines each channel representation through temporal and frequency-domain enhancement. This architecture follows an explicit factorization principle, separating inter-channel interaction modeling from intra-channel dynamic refinement. Such modular decomposition improves structural interpretability and stabilizes representation learning across heterogeneous modalities.

Cross-Series Intersection (CSI)

The Cross-Series Intersection (CSI) module is designed to capture structured dependencies among physiological channels. For EEG signals, CSI models spatial correlations across brain regions; for peripheral signals, it reflects coordinated autonomic responses; for EOG signals, it captures bilateral eye-movement coherence. These cross-channel relationships are essential for affective inference, yet they are often implicitly entangled with temporal modeling in conventional sequence encoders.

To explicitly model channel interactions, CSI first summarizes temporally tokenized features into compact channel descriptors via temporal pooling:

G = \frac{1}{T} \sum_{t = 1}^{T} F_{i} (:, :, t) \in R^{B \times d},

(7)

where B denotes the batch size and d the latent feature dimension.

For each sample b, the pooled descriptor

G (b) \in R^{d}

is projected into a channel interaction space via a low-rank bilinear formulation:

U (b) = W_{u}^{csi} G (b), V (b) = W_{v}^{csi} G (b),

(8)

where

W_{u}^{csi}, W_{v}^{csi} \in R^{C \times d}

are learnable projection matrices specific to the CSI module, and C denotes the number of channels. This formulation maps global channel descriptors into a structured interaction space while maintaining parameter efficiency.

The cross-series affinity matrix for each sample b is then computed as

W_{u}^{csi}, W_{v}^{csi} \in R^{d \times d}

(9)

which explicitly encodes pairwise channel correlations. The low-rank bilinear design reduces computational overhead while preserving sufficient flexibility for modeling inter-channel dependencies.

After row-wise softmax normalization,

A_{w} (b, i, j) = \frac{\exp (Q_{w} (b, i, j))}{\sum_{j^{'} = 1}^{C} \exp (Q_{w} (b, i, j^{'}))},

(10)

each channel adaptively aggregates information from other channels. The refined representation is obtained through residual interaction:

F_{CSI} (b) = A_{w} (b) F_{i} (b) + F_{i} (b),

(11)

where

F_{i} (b) \in R^{d \times T}

denotes the tokenized representation of sample b. The residual connection preserves original tokenized features and stabilizes optimization. Through this formulation, CSI provides an explicit and data-adaptive mechanism for cross-channel interaction modeling, enabling coordinated feature refinement across heterogeneous physiological signals.

Intra-Series Intersection (ISI)

The Intra-Series Intersection (ISI) module refines temporal dynamics within each individual channel by jointly modeling time-domain and frequency-domain characteristics. While CSI captures cross-channel interaction structure, ISI enhances channel-wise representations by exploiting spectral patterns that are known to be informative for affective physiological analysis.

Time-Domain Projection and Nonlinearity

Given

F_{CSI} \in R^{B \times d \times T}

, ISI first applies a channel-wise linear transformation followed by a smooth non-linearity:

F_{lin} = W_{l}^{isi} F_{CSI} + b_{l}^{isi},

(12)

F_{act} = GELU (F_{lin}),

(13)

where

W_{l}^{isi} \in R^{d \times d}

and

b_{l}^{isi} \in R^{d}

are learnable parameters. This step increases feature expressiveness and prepares the representation for spectral refinement.

Frequency-Domain Transformation

To capture oscillatory and rhythmic structures, ISI transforms the activated features into the frequency domain via a one-dimensional FFT along the temporal axis:

F (F_{act}) (b, :, k) = \sum_{t = 0}^{T - 1} F_{act} (b, :, t) e^{- i 2 π k t / T} .

(14)

It is worth noting that the FFT output is complex-valued, i.e.,

F (F_{act}) \in C^{B \times d \times T}

. Each frequency component consists of a real and an imaginary part, which jointly encode amplitude and phase information. This transformation decomposes temporal signals into frequency components, allowing explicit modeling of multi-scale spectral characteristics that are relevant to neural oscillations and autonomic rhythms.

Learnable Frequency Gating

Instead of relying on predefined frequency bands, ISI introduces a learnable frequency modulation vector:

Q_{f} \in R^{T},

(15)

which operates on the magnitude spectrum while preserving phase information. Specifically, let

F (F_{act}) = A ⊙ e^{i Φ},

where A and

Φ

denote the amplitude and phase components, respectively. The learnable vector

Q_{f}

scales only the amplitude term:

F_{freq} = (A ⊙ Q_{f}) ⊙ e^{i Φ} .

(16)

This design ensures that phase consistency is maintained while enabling adaptive reweighting of spectral energy. Here,

Q_{f}

acts as a data-driven spectral weighting mechanism, allowing the model to emphasize informative frequency components while maintaining flexibility across subjects and modalities.

Inverse Transformation and Residual Fusion

The refined representation is mapped back to the time domain:

F^{- 1} (F_{freq}) = \frac{1}{T} \sum_{k = 0}^{T - 1} F_{freq} (b, :, k) e^{i 2 π k t / T} .

(17)

Since the input temporal signal is real-valued, its FFT representation satisfies conjugate symmetry. After inverse transformation, we take the real part of the reconstructed signal:

Re (F^{- 1} (F_{freq})),

which guarantees that the output remains in the real domain. The final output is obtained via residual fusion:

F_{ISI} = Re (F^{- 1} (F_{freq})) + LayerNorm (F_{act}) .

(18)

The residual connection preserves temporal structure while incorporating frequency-aware refinement, thereby stabilizing optimization and preventing over-suppression of informative time-domain features.

Design Rationale

Physiological responses often exhibit both transient temporal events and sustained oscillatory behaviors. By combining adaptive spectral modulation with residual time-domain refinement, ISI provides a flexible mechanism for capturing complementary temporal and frequency-domain characteristics. Together with CSI, this design realizes an explicit factorization of representation learning into cross-channel interaction modeling and intra-channel dynamic refinement, thereby improving modularity, interpretability, and generalization robustness.

3.1.5. Output Generation

After intra-series refinement, DTFE produces a compact, modality-specific representation that summarizes affect-relevant information from the input signal. Given

F_{ISI} \in R^{B \times d \times T}

, a linear projection is first applied to unify the feature dimension:

F_{proj} = W_{d}^{out} F_{ISI} + b_{d}^{out} \in R^{B \times d \times T},

(19)

where

W_{d}^{out} \in R^{d \times d}

and

b_{d}^{out} \in R^{d}

are learnable parameters. This projection stabilizes feature scaling and adapts the representation for downstream aggregation.

Temporal Aggregation via Learnable Attention Pooling

To adaptively summarize temporal dynamics, a learnable attention-based aggregation mechanism is adopted. The temporal importance weights are computed as:

α = Softmax (\frac{{(w_{pool})}^{⊤} F_{proj}}{\sqrt{d}}) \in R^{B \times T},

(20)

where

w_{pool} \in R^{d}

is a learnable vector. The aggregated representation is:

z = \sum_{t = 1}^{T} α_{t} F_{proj} (:, :, t) \in R^{B \times d} .

(21)

This mechanism assigns adaptive weights to temporally informative segments, enabling flexible modeling of both transient responses and sustained affective patterns.

Output Embedding Projection

The aggregated feature is further projected into the output embedding space:

Y = W_{p}^{emb} z + b_{p}^{emb} \in R^{B \times τ},

(22)

where

τ

denotes the embedding dimension, and

W_{p}^{emb} \in R^{τ \times d}

,

b_{p}^{emb} \in R^{τ}

are learnable parameters. The resulting vector Y serves as the final modality-specific representation generated by DTFE and is subsequently fed into the cross-modal interaction and hierarchical fusion module.

3.2. Cross-Modal Interaction and Fusion

Overview. As illustrated in Figure 3, the cross-modal interaction and fusion module integrates modality-specific representations extracted by DTFE into a unified affective embedding. Instead of performing early fusion through direct feature concatenation, the proposed design conducts interaction modeling at the representation level, allowing adaptive integration of complementary physiological cues from neural, autonomic, and attentional systems.

Modality-level feature construction.

Let

S \in R^{B \times d}

,

L \in R^{B \times d}

, and

I \in R^{B \times d}

denote the modality-specific embeddings produced by the EEG, peripheral physiological, and EOG DTFEs, respectively. These embeddings are stacked to form a modality token sequence:

F = [S; L; I] \in R^{B \times M \times d},

(23)

where

M = 3

denotes the number of modalities. Each token summarizes affect-relevant information from a distinct physiological subsystem, serving as a compact semantic unit for cross-modal interaction.

Cross-modal attention-based interaction.

To model inter-modality relationships, multi-head self-attention is applied over the modality tokens:

F_{att} = MultiHeadAttention (F, F, F) .

(24)

Specifically, given

F \in R^{B \times M \times d}

, the input is first linearly projected into queries, keys, and values:

Q = F W_{Q}^{fusion}, K = F W_{K}^{fusion}, V = F W_{V}^{fusion},

(25)

where

W_{Q}^{fusion}, W_{K}^{fusion}, W_{V}^{fusion} \in R^{d \times d}

are learnable projection matrices. The projected features are then split into H attention heads, each with dimensionality

d_{h} = d / H

. For the h-th head, scaled dot-product attention is computed as:

{Attention}_{h} (Q, K, V) = Softmax (\frac{Q_{h} K_{h}^{⊤}}{\sqrt{d_{h}}}) V_{h} .

(26)

The outputs of all heads are concatenated and projected:

F_{att} = Concat ({Attention}_{1}, \dots, {Attention}_{H}) W_{O}^{fusion},

(27)

where

W_{O}^{fusion} \in R^{d \times d}

is a learnable output matrix. Queries, keys, and values are derived from the same token set, enabling each modality representation to adaptively attend to others. Since the number of modality tokens is small (

M = 3

), the computational complexity of attention remains negligible while still allowing expressive cross-modal interaction modeling.

Feature refinement and residual learning.

The attended representations are further refined using a position-wise feed-forward network (FFN) with residual connection:

F_{ref} = LayerNorm (FFN (F_{att}) + F_{att}) .

(28)

This refinement enhances nonlinear cross-modal interactions while preserving stable gradient propagation through residual learning.

Global fusion and unified representation.

A global pooling operation aggregates the refined modality tokens into a single fused embedding:

z_{fusion} = \frac{1}{M} \sum_{m = 1}^{M} F_{ref}^{(m)} \in R^{B \times d} .

(29)

The resulting representation

z_{fusion}

serves as a unified affective descriptor integrating neural, autonomic, and attentional information. It is subsequently fed into task-specific prediction heads for emotion and preference recognition.

Discussion. The proposed fusion strategy provides three practical advantages: (1) it performs cross-modal interaction at a high-level representation space, reducing noise propagation from raw signal domains; (2) it operates on compact modality embeddings, avoiding redundant temporal modeling at the fusion stage; and (3) it enables adaptive, data-driven weighting of physiological cues according to affective context. By conducting interaction modeling on modality-level tokens, the fusion module offers a scalable and interpretable mechanism for integrating heterogeneous physiological signals, which is particularly suitable for music therapy applications where multiple physiological systems jointly contribute to emotional responses.
Computational efficiency. In addition to accuracy, computational efficiency is a practical consideration in multi-modal affective modeling. Our design emphasizes compact token-based encoding and lightweight cross-modal interaction so that performance gains do not rely on excessive model scaling. As reported in Table 2, our method achieves the best overall multi-task performance while maintaining moderate computational cost (2.86 M parameters and 0.82 G FLOPs), which is lower than Transformer-based baselines and notably more efficient than recent EEG-specific architectures. These results indicate that the proposed structured DTFE and hierarchical fusion introduce only limited overhead, yet they deliver substantial accuracy improvements.

3.2.1. Multi-Task Loss Function

We formulate emotional valence, arousal, and music preference recognition as parallel multi-class classification tasks with nine ordinal categories each. These affective dimensions are inherently correlated yet exhibit different levels of label imbalance, subjectivity, and learning difficulty. To jointly optimize these tasks while maintaining robustness to noisy annotations, we design a unified multi-task objective that combines focal cross-entropy with label smoothing and adaptive task weighting.

Overall Objective

The total training loss is defined as a weighted sum of task-specific losses:

L_{total} = λ_{e} L_{e} + λ_{a} L_{a} + λ_{p} L_{p},

(30)

where

L_{e}

,

L_{a}

, and

L_{p}

correspond to the losses for valence, arousal, and preference, respectively. The weights

λ_{e}

,

λ_{a}

, and

λ_{p}

are learnable scalar parameters, initialized to

\frac{1}{3}

and jointly optimized with the network. This design enables the model to automatically balance task contributions during training, alleviating dominance by easier or over-confident tasks.

Task-Specific Loss Formulation

For each affective dimension

t \in {e, a, p}

, we adopt a focal cross-entropy loss with label smoothing:

L_{t} = - \sum_{i = 1}^{9} {(1 - {\hat{y}}_{t}^{i})}^{γ} \cdot [(1 - α) y_{t}^{i} + \frac{α}{9}] \log ({\hat{y}}_{t}^{i}),

(31)

where

y_{t}^{i}

denotes the one-hot ground-truth label for class i, and

{\hat{y}}_{t}^{i}

is the predicted softmax probability. The focal factor

{(1 - {\hat{y}}_{t}^{i})}^{γ}

down-weights well-classified samples and emphasizes harder or ambiguous instances, which are common in affective annotations. The label smoothing term, controlled by

α

, mitigates over-confidence and accounts for subjective uncertainty in emotion perception, particularly at category boundaries.

Hyperparameter Selection and Sensitivity Analysis

The focal factor

γ

controls the emphasis on hard samples. According to the original focal loss formulation [39],

γ

is typically set between 1 and 3. In our experiments, we evaluated

γ

within this commonly adopted range and observed stable performance when

γ \in [2.0, 3.0]

. The label smoothing coefficient

α

regulates the degree of confidence relaxation. Prior studies commonly adopt

α

between 0.05 and 0.1. We conducted additional sensitivity analysis by testing

α \in {0.0, 0.05, 0.1, 0.15}

, and we found that

α = 0.1

provides the best balance between regularization and classification sharpness. Across all affective dimensions, performance variation remained within 0.7% under these ranges, indicating that the proposed framework is not overly sensitive to these hyperparameters. In all reported experiments, we fix

γ = 2.5

and

α = 0.1

without task-specific tuning.

4. Experiments

4.1. Dataset and Task Setup

The evaluation of the proposed framework is conducted on the well-known DEAP dataset [40], a multimodal benchmark extensively used in affective computing research. The dataset comprises recordings from 32 participants, each viewing 40 one-minute-long music video clips while multiple physiological modalities were simultaneously recorded. Specifically, DEAP provides 32-channel electroencephalography (EEG), six peripheral physiological signals (galvanic skin response (GSR), blood volume pulse (BVP), zygomaticus and trapezius electromyograms (EMG-Zyg, EMG-Trap), respiration, and temperature), along with two channels of electrooculography (EOG). All signals were originally sampled at 512 Hz and subsequently downsampled to 128 Hz for computational efficiency. The EEG data were bandpass-filtered between 4 and 45 Hz to retain relevant neural oscillatory components.

Strict subject-level partition before window segmentation.

To rigorously prevent any potential information leakage, subject-level splitting is performed prior to any temporal window segmentation. Concretely, the 32 participants are first partitioned into mutually exclusive training and testing subject sets. Only after this subject-wise partition is finalized, sliding-window segmentation is independently applied within each split. This ensures that no raw recordings from a given subject contribute windows to both the training and testing sets. For temporal modeling, unless otherwise specified, the recordings are segmented into overlapping 10 s windows (1280 time steps) with a stride of 1 s. Each window yields three modality-specific tensors: EEG (

[32, 1280]

), peripheral physiological signals (

[6, 1280]

), and EOG (

[2, 1280]

). Although overlapping windows (10 s length with 1 s stride) generate highly correlated samples within the same subject, such overlap exists strictly within each subject-specific partition. No overlapping or non-overlapping window derived from the same subject appears across the training and testing sets. Therefore, the adopted segmentation strategy does not introduce cross-split information leakage. All signal streams are z-score normalized on a per-subject basis to eliminate inter-individual variability. Unless otherwise specified, all experiments follow a subject-wise data splitting protocol with an 8:2 ratio for training and testing. Specifically, subjects are partitioned into mutually exclusive splits, and all sliding windows extracted from the same subject are strictly assigned to either the training or the testing set, but never both. This design prevents subject identity leakage and enables a fair evaluation of cross-subject generalization. Importantly, normalization statistics are computed independently for each subject using only that subject’s data, and no global normalization across subjects is performed. This further eliminates any unintended statistical leakage between training and testing partitions. We do not perform subject-dependent training in any experiment reported in this paper. Within the training set, a subset of subjects is further held out for validation to support hyperparameter tuning and early stopping. To generate compact and discriminative representations, a modality-specific tokenization layer is applied within the proposed DTFE module. The EEG stream is projected into four learnable tokens, the peripheral stream into two tokens, and the EOG stream into one token. This asymmetric token allocation follows a capacity–complexity matching principle. EEG contains 32 channels and exhibits rich spatial–temporal dynamics with diverse oscillatory patterns, thus requiring higher representational capacity to preserve informative variability after tokenization. In contrast, peripheral physiological signals include only six channels and primarily capture smoother autonomic trends, for which fewer tokens are sufficient. EOG consists of only two channels and mainly reflects coarse attentional activity, where a single token provides a compact yet stable summary without introducing redundancy. Therefore, the 4/2/1 configuration is a modality-aware capacity assignment strategy that balances expressiveness and computational efficiency rather than an empirical hyperparameter choice. This design rationale is further supported by the token sensitivity analysis in Table 3, where both reduced (2/1/1) and substantially increased (8/4/2) allocations lead to either degraded performance or only marginal gains, confirming that the default configuration achieves an effective balance between capacity and redundancy. This asymmetric token design offers a balanced trade-off between representational richness and parameter efficiency, reflecting the relative complexity and dimensionality of each signal modality. To support both the main contribution of this work and fair comparison with prior studies, we report results under two complementary task settings: (1) a fine-grained 9-class setting used as the primary configuration throughout this paper, and (2) an aligned 2/5-class setting used only for reproducing the evaluation protocol commonly adopted in prior DEAP works (e.g., DEMA) and for Table 4. A multi-task learning paradigm is adopted to jointly predict three key affective dimensions:

Valence, Arousal, and Liking scores, originally rated on a 9-point scale in the DEAP dataset, are treated as nine ordered categories and formulated as parallel 9-class classification tasks. Each affective dimension is modeled with an independent prediction head equipped with softmax activation. The optimization objective adopts focal cross-entropy with label smoothing to mitigate potential class imbalance and reduce over-confidence under subjective affective annotations.

Aligned setting for prior-work comparison (Table 4). In addition to the above 9-class formulation, we also evaluate a reduced-label configuration: valence and arousal are binarized into high vs. low (2-class) tasks following common DEAP practice, and categorical emotion is evaluated under a 5-class scheme (Q1–Q4 plus neutral). This setting is not the main task configuration of this paper and is used solely for fair comparison with existing single-task studies. In addition, an auxiliary branch is introduced for discrete affective quadrant classification:

Quadrants (Q1–Q4) and Neutral states are modeled as a five-dimensional multi-label classification task, trained independently from the valence–arousal–liking pipeline. This separation enables an explicit evaluation of categorical emotion distribution without mutual interference from ordinal regression targets.

Model performance is quantitatively evaluated using three standard metrics: Accuracy (Acc) measures the proportion of correctly classified samples among the total number of samples, reflecting overall recognition correctness. F1-score (F1) represents the harmonic mean of precision and recall, providing a balanced assessment under potential class imbalance. Precision (Prec) quantifies the ratio of true positive predictions to all positive predictions, indicating the reliability of positive classifications.

4.2. Class Distribution Analysis

The DEAP dataset provides self-assessed affective ratings on a 1–9 Likert scale for each trial. In total, 32 participants completed 40 trials each, resulting in 1280 annotated labels per affective dimension. All sliding windows derived from a given trial inherit its trial-level label. Therefore, the window-level distribution strictly follows the original rating distribution and does not introduce additional imbalance.

As shown in Table 5, the distribution approximately follows a bell-shaped pattern, with mid-level ratings (4–6) occurring more frequently and extreme categories (1 and 9) less commonly. The largest class (score 5) accounts for 18% of samples, while the smallest classes (scores 1 and 9) account for 6%. The maximum imbalance ratio is therefore about 3:1, which indicates moderate but not severe class imbalance. To mitigate potential bias toward more frequent categories, we employ focal cross-entropy in training, which naturally emphasizes harder and less-confident samples. Additionally, label smoothing is applied to reduce over-confidence near ordinal boundaries. No explicit re-sampling is performed, as the observed imbalance remains within a manageable range.

4.3. Training Configuration

For the multi-task classification of valence, arousal, and liking, we use the Adam optimizer with a learning rate of

1 \times 10^{- 4}

, a batch size of 32, and weight decay of

1 \times 10^{- 5}

. The model is trained for 120 epochs with early stopping based on the validation loss of the valence prediction task. The total loss is a weighted sum of the three task-specific losses, with learnable weights

λ_{e}

,

λ_{a}

, and

λ_{p}

initialized to

1 / 3

. For the Q1–Q4 + neutral classifier, we train a separate model using sigmoid activation and binary cross-entropy loss. The same optimizer and training schedule are applied.

For the aligned 2/5-class evaluation used in Table 4, we keep the same network backbone and training pipeline but replace the 9-class softmax heads with binary classification heads for valence and arousal (high vs. low). This ensures that the comparison results are produced under the same model design while matching the label space reported in prior works. All experiments are implemented in PyTorch 2.0.1 and executed on two NVIDIA RTX 4090 GPUs.

4.4. Comparison with Prior Works on DEAP Dataset

To benchmark our proposed framework, we compare its single-task performance with a series of state-of-the-art emotion recognition models evaluated on the DEAP dataset. Since most existing works are designed as single-task pipelines and report classification accuracy only for valence, arousal, or quadrant-based labels, we reconfigure our framework to match the category settings adopted in those studies. Specifically, in this subsection, valence and arousal are formulated as binary (high vs. low) classification tasks, and quadrant evaluation follows a five-class setting (Q1–Q4 plus neutral), consistent with works such as DEMA [12]. It should be emphasized that the results reported in Table 4 correspond to this 2-class and 5-class formulation and do not reflect the 9-class fine-grained classification setting used in our primary multi-task experiments. The 9-class setting represents the main task configuration of this paper, while the present subsection is provided solely for fair comparison with prior single-task studies. In particular, subject-wise data splitting is enforced, ensuring that physiological recordings from the same participant never appear in both training and testing sets. For fair comparison with prior DEAP studies, non-overlapping 10 s windows (stride = window length) are adopted in this subsection, as many existing single-task methods report results under trial-level or non-overlapping segmentation protocols. Aligning the windowing strategy minimizes protocol-induced performance discrepancies. It should be emphasized that this configuration is used exclusively for the comparison results in Table 4. In contrast, the primary 9-class multi-task experiments and ablation studies employ overlapping windows to improve sample efficiency and training stability. In both settings, subject-level partitioning is strictly enforced prior to window segmentation. Sliding windows are generated independently within each subject split, and all windows derived from a given subject are confined to either the training or testing partition, thereby preventing subject-level information leakage. For clarity, performance values of the compared methods are directly taken from their respective publications as reported, without re-implementation. Evaluation protocols may differ across those works. All experiments for our method in this subsection follow the same subject-independent evaluation protocol described above. Specifically, training and testing sets are split at the subject level, such that no subjects appear in both sets.

Discussion. The proposed method consistently outperforms prior models across all three affective dimensions. Under the aligned 2-class and 5-class evaluation setting, our framework achieves competitive and superior performance while maintaining strict subject-independent splitting. Specifically, as shown in Table 4, our method achieves 98.21 ± 0.29 on valence, 98.34 ± 0.31 on arousal, and 97.85 ± 0.26 on Q1–Q4 + neutral classification, where the reported values correspond to the mean and standard deviation over 10 independent subject-wise splits. Compared with DEMA [12], our model yields average improvements of

+ 0.66 %

in valence,

+ 0.73 %

in arousal, and

+ 0.84 %

in Q1–Q4 plus neutral accuracy when evaluated under the same 2-class and 5-class formulation. Importantly, these results are obtained under strict cross-subject evaluation settings, where subjects in the test set are completely unseen during training. The consistently low standard deviation across repeated runs indicates that the observed performance gains are stable and not dependent on a particular random subject partition. It should be emphasized again that these results correspond to the binary and five-class configuration adopted for fair comparison with prior single-task studies. The primary contribution of this work lies in the 9-class fine-grained multi-task setting, while the present comparison demonstrates the adaptability and robustness of the proposed framework under conventional DEAP evaluation protocols.

4.5. Multi-Task Classification Results and Analysis

Table 2 reports the performance of various baseline models equipped with our proposed multi-task classification heads under a consistent training pipeline. All methods are evaluated on the DEAP dataset with a 9-class setup for valence, arousal, and liking dimensions. Unless otherwise specified, all experiments in this section follow a subject-wise 8:2 train–test split, ensuring that samples from the same subject do not appear in both training and testing sets, thereby avoiding data leakage. Since no prior work explicitly focuses on multi-task modeling of valence, arousal, and liking jointly, we include a comprehensive set of representative single-task models as baselines. These include common recurrent structures (LSTM, BiLSTM, GRU), convolutional models (1D-CNN, CNN-LSTM), and attention-based or non-deep learning models (Transformer, XGBoost). Each baseline is reconfigured to support our multi-head multi-task setting and trained under the same subject-wise split for fair comparison.

As shown in Table 2 and visualized in Figure 4, Figure 5 and Figure 6, our full model consistently achieves superior results across all affective dimensions and evaluation metrics. Specifically, our method reaches 92.8% accuracy on valence, 91.8% on arousal, and 93.6% on liking—each surpassing the closest baseline by 2–5 percentage points. F1-scores and Precision follow the same trend, highlighting not only the accuracy but also the robustness of our predictions. Compared to strong single-task models such as Transformer-based [38] or BiLSTM-based [32], our method exhibits more balanced performance across all tasks. For instance, while Transformer performs well on arousal (90.1% accuracy), it struggles on liking (84.2%), indicating limited generalization in multi-target scenarios. Similarly, BiLSTM underperforms on both arousal and liking dimensions. Tree-based models like XGBoost [36] and hybrid models such as CNN-LSTM [37], though competitive on individual metrics, show inconsistent behavior across tasks, further underscoring the necessity of joint modeling. Overall, the superior and stable performance of our method demonstrates the effectiveness of shared feature representation, enhanced by the proposed DTFE module and cross-modal fusion strategy. These results validate that our approach not only outperforms all baselines but also provides a scalable and unified solution for multi-dimensional affective modeling with strong generalization ability.

Since no prior work explicitly focuses on multi-task modeling of valence, arousal, and liking jointly, we include a comprehensive set of representative single-task models as baselines. These include common recurrent structures (LSTM, BiLSTM, GRU), convolutional models (1D-CNN, CNN-LSTM), and attention-based or non-deep learning models (Transformer, XGBoost). To further strengthen the comparison with recent EEG-specific emotion recognition architectures, we additionally include three state-of-the-art models published in 2024–2025: EmT [18], MASA-TCN [19], and AT-DGNN [20]. EmT is a Transformer-based framework designed for generalized cross-subject EEG emotion recognition, incorporating global attention mechanisms to enhance subject-invariant representation learning. MASA-TCN introduces a multi-anchor space-aware temporal convolutional structure that captures spatial–temporal EEG dependencies through anchor-based convolution. AT-DGNN leverages dynamic graph neural networks combined with temporal attention to model inter-channel relationships and music-driven emotional interactions. Each baseline is reconfigured to support our multi-head multi-task setting and trained under the same subject-wise split for fair comparison. For EmT, MASA-TCN, and AT-DGNN, the original task-specific classification heads are replaced with our shared multi-task prediction heads, while preserving their backbone architectures and feature extraction modules. All models are trained under identical data splits, preprocessing pipelines, and optimization strategies to ensure a fair and controlled evaluation.

Compared with recent EEG-specialized architectures, EmT achieves strong cross-subject performance owing to its Transformer-based global modeling. MASA-TCN and AT-DGNN also demonstrate competitive results, highlighting the effectiveness of spatial-temporal convolution and graph-based interaction modeling. However, these models were originally designed for single-task emotion recognition. When adapted to the unified multi-task setting, their performance shows moderate degradation, particularly on the liking dimension. In contrast, the proposed framework maintains consistently balanced and superior performance across all three affective dimensions, demonstrating the advantage of structured shared representation learning combined with hierarchical cross-modal fusion. In addition to performance comparison, Table 2 also reports the estimated parameter size, computational complexity (FLOPs), and total training time on two 4090 GPUs. It can be observed that traditional recurrent models such as LSTM and GRU maintain moderate parameter sizes (1.38 M–1.62 M) and relatively low FLOPs (0.20 G–0.24 G), whereas Transformer-based models exhibit the highest computational cost (3.78 M parameters and 1.05 G FLOPs). The proposed method contains 2.86 M parameters and requires 0.82 G FLOPs, which is lower than the Transformer and CNN-LSTM baselines while achieving substantially superior multi-task performance. This indicates that the proposed DTFE and hierarchical fusion modules introduce only a moderate increase in computational complexity yet yield significant gains in accuracy across all affective dimensions. Therefore, the performance improvement does not stem from excessive model scaling but rather from more effective structured representation learning.

To further investigate the class-wise prediction behavior of the proposed multi-task framework, the normalized confusion matrices for the 9-class Valence, Arousal, and Liking tasks are illustrated in Table 6, Table 7, and Table 8, respectively. Overall, the majority of samples are concentrated along the main diagonal, which is consistent with the high overall accuracies reported in Table 2 (92.8%, 91.8%, and 93.6% for Valence, Arousal, and Liking, respectively). Notably, classes C4–C6 exhibit relatively larger off-diagonal distributions, indicating increased confusion among intermediate emotional intensity levels. This phenomenon is reasonable given that adjacent emotional states in the valence–arousal space tend to share overlapping physiological patterns. In contrast, extreme classes (e.g., C1–C3 and C8–C9) demonstrate more compact diagonal dominance, suggesting that the proposed model can effectively capture distinctive affective signatures in both low- and high-intensity regions. These observations further validate the robustness and discriminative capability of the proposed multi-modal multi-task framework.

4.6. Ordinal Structure Analysis and Modeling Justification

The affective ratings in the DEAP dataset are provided on a discrete 1–9 Likert scale, which inherently possesses ordinal properties. That is, the categories follow a natural order, and adjacent scores are semantically closer than distant ones. To address the concern regarding the absence of ordinal structure analysis, we provide additional structural and quantitative evaluations. First, inspection of the normalized confusion matrix reveals that most misclassifications occur between adjacent categories, while long-range errors spanning multiple rating levels are rare. For example, samples labeled as score 4 are predominantly misclassified as 3 or 5, whereas extreme categories (1 and 9) are almost never directly confused with each other. This structured error pattern indicates that the model does not behave as an unordered flat classifier but instead implicitly captures the ordinal relationships among affective levels. Second, we compute the Mean Absolute Error (MAE) as an ordinal consistency metric. Under the 9-class setting, the MAE values for Valence, Arousal, and Liking are 0.43, 0.46, and 0.41, respectively, which are substantially lower than the theoretical error of random prediction (approximately 2.67). These results further demonstrate that the model maintains strong ordinal consistency in terms of numerical distance between predicted and true labels. Regarding modeling choice, most existing DEAP-based emotion recognition studies adopt a standard multi-class classification framework rather than explicit ordinal regression. For instance, the recent DEMA model [12] reports 97.55% (valence) and 97.61% (arousal) using a softmax-based classification strategy without incorporating a dedicated ordinal regression mechanism. This reflects a widely accepted practice in the field. Moreover, affective self-assessments inherently involve subjective uncertainty and blurred class boundaries. Strict ordinal regression assumptions (e.g., fixed inter-class distances) may not fully hold under such conditions. In contrast, a well-optimized multi-class framework, combined with focal loss and label smoothing, can naturally learn relative category relationships while maintaining stable training dynamics. In summary, although an explicit ordinal regression model is not adopted, both confusion matrix analysis and MAE evaluation confirm that the proposed framework effectively captures the ordinal structure of affective ratings. Therefore, the modeling choice does not constitute a methodological deficiency.

4.7. Statistical Performance Analysis

Table 9 reports the multi-task classification performance under the 9-class setup using mean ± standard deviation over 10 independent subject-wise splits. This statistical reporting provides a more reliable evaluation of cross-subject generalization and avoids potential bias from single best-run results.

Across all affective dimensions, the proposed method consistently achieves the highest mean accuracy, F1-score, and precision.

Specifically, our framework obtains

92.47 % \pm 0.29

(Valence),

91.42 % \pm 0.33

(Arousal), and

93.08 % \pm 0.26

(Liking) in terms of accuracy, which clearly surpasses all baseline models. Importantly, the relatively small standard deviations (all below

0.35

) demonstrate strong stability across different subject partitions, indicating robust cross-subject generalization rather than performance reliance on a particular split.

Compared with recurrent models (LSTM, GRU, BiLSTM), our method improves accuracy by approximately 5–8 percentage points across tasks. In addition to conventional sequence models, we also compare with recent EEG-specific architectures, including EmT [18], MASA-TCN [19], and AT-DGNN [20]. EmT adopts a transformer-based design to enhance cross-subject representation learning, MASA-TCN leverages multi-anchor temporal convolution for spatial–temporal EEG modeling, and AT-DGNN employs graph neural networks with attention mechanisms to capture dynamic inter-channel dependencies. Although these advanced architectures achieve competitive results on certain dimensions (e.g., Arousal), their overall performance remains inferior to the proposed structured multi-modal modeling framework under the unified multi-task setting. Overall, the results confirm that the proposed framework achieves superior multi-dimensional affective modeling with strong statistical robustness under strict subject-independent evaluation.

LOSO cross-validation and windowing effect analysis.

To further validate the robustness of cross-subject generalization, we conduct additional leave-one-subject-out (LOSO) cross-validation for the proposed framework under the primary 9-class multi-task setting. In this protocol, data from one participant are held out for testing, while the remaining 31 participants are used for training. The procedure is repeated for all 32 subjects. We report the mean ± standard deviation (STD) across the 32 LOSO folds, together with the 95% confidence interval (CI), computed as

CI = 1.96 \cdot \frac{STD}{\sqrt{32}}

. The quantitative results of the LOSO cross-validation are summarized in Table 10.

As observed, LOSO results remain consistent with those obtained under repeated subject-wise random splits (Table 2) while exhibiting higher variance as expected since each fold tests on a single unseen subject. Importantly, the 95% CIs are narrow (within ±0.8%), indicating that the overall performance level is stable and not driven by a particular subject partition.

Statistical Significance Test.

Although mean ± standard deviation provides an indication of performance stability, we further conducted paired statistical significance tests to more rigorously assess the reliability of the observed performance differences. Specifically, for each affective dimension (Valence, Arousal, and Liking), paired t-tests were performed between the proposed method and the strongest baseline (EmT) across the same 10 subject-wise splits. Since all models were evaluated under identical data partitions, the paired test is appropriate for quantifying performance differences under matched experimental conditions. The results show that the proposed method exhibits statistically significant performance improvements over the Transformer baseline in all three tasks (Valence:

p < 0.001

, Arousal:

p < 0.01

, Liking:

p < 0.001

). These findings suggest that the observed gains are unlikely to be attributed solely to random variation across splits. Overall, the statistical analysis provides additional evidence supporting the stability and cross-subject generalization capability of the proposed multi-task framework.

4.8. Robustness Analysis Under Different Train–Test Splits

To further evaluate the robustness and generalization ability of the proposed framework, we conduct additional experiments under more challenging train–test splits. Besides the default 8:2 setting, we consider subject-wise 7:3 and 6:4 splits, where fewer training samples are available and the risk of overfitting is more pronounced. All splits are performed in a subject-independent manner to strictly avoid data leakage. Table 11 and Table 12 report the multi-task classification results under the 7:3 and 6:4 splits, respectively. As expected, reducing the proportion of training data leads to a consistent performance degradation across all baseline methods and affective dimensions. This trend is particularly evident for recurrent and convolutional baselines, which exhibit noticeable drops in both accuracy and F1-score as the training data becomes more limited. Despite the increased difficulty, our proposed model maintains a clear performance advantage under both settings. Under the 7:3 split, our method achieves 91.7%, 90.6%, and 92.4% accuracy on valence, arousal, and liking, respectively, consistently outperforming the strongest baselines by a clear margin. When the split is further reduced to 6:4, the performance of all methods decreases more substantially; nevertheless, our framework still attains 90.2% accuracy on valence, 88.9% on arousal, and 90.9% on liking, remaining the best-performing approach across all three affective dimensions. Notably, compared with Transformer-based and recurrent models, our method exhibits a slower performance degradation as the amount of training data decreases. This observation indicates that the proposed Dynamic Token Feature Extractor (DTFE) and the hierarchical cross-modal fusion strategy enable more data-efficient representation learning by effectively capturing complementary temporal and frequency-domain cues across heterogeneous physiological signals. Overall, these results demonstrate that the proposed framework is not only effective under standard evaluation protocols, but also robust to reduced training data, highlighting its strong generalization capability in realistic, data-constrained affective computing scenarios.

4.9. Detailed 9-Class Precision Analysis

To provide a fine-grained evaluation of the proposed multi-task framework, Table 13 reports the detailed precision scores for each of the nine ordinal classes across Valence, Arousal, and Liking. As shown in Table 13, the proposed method achieves consistently high precision across all nine categories. For Valence, Arousal, and Liking, the macro precision reaches 90.2%, 89.9%, and 89.4%, respectively. The performance remains stable across ordinal levels, including extreme classes (Class 1 and Class 9), indicating balanced fine-grained discrimination rather than improvement limited to specific categories. These results further confirm the robustness and consistency of the multi-task modeling strategy under subject-independent evaluation.

4.10. Ablation Study on DTFE and Fusion Modules

To further assess the contribution of each key component, we conduct ablation experiments by selectively removing the Dynamic Token Feature Extractor (DTFE) and the Cross-Modal Interaction and Fusion module. As shown in Table 14, removing the DTFE modules (w/o DTFE) significantly degrades performance across all metrics, with valence accuracy dropping from 92.8% to 88.5%. This confirms the importance of DTFE’s token-based temporal-frequency modeling in capturing emotional patterns. Similarly, disabling the Cross-Modal Fusion module (w/o Cross-Modal Fusion) leads to a consistent drop in performance, particularly on the liking dimension (from 93.6% to 89.1%), suggesting that cross-modal alignment is critical for modeling subjective preference responses. In both cases, our full model consistently outperforms the ablated variants, validating the necessity of both DTFE and hierarchical fusion for effective multi-modal emotion and preference recognition.

4.11. Ablation Study: Learnable vs. Fixed Loss Weights

To evaluate the effectiveness of our learnable loss weight mechanism, we compare the performance of models trained with fixed weights

λ_{e} = λ_{a} = λ_{p} = \frac{1}{3}

and models trained with learnable weights initialized to the same values and optimized jointly with the main network. Both settings share the same architecture and training schedule, isolating the impact of dynamic loss weighting.

Analysis. As shown in Table 15, enabling the loss weights to be learnable significantly improves classification performance across all three affective dimensions. Compared to fixed equal weights, the dynamic weighting mechanism allows the model to adaptively prioritize more difficult or underperforming tasks during training. This leads to more balanced optimization and consistent gains in accuracy, F1-score, and precision. We conclude that learnable task weights are essential for achieving optimal performance in multi-task affective modeling.

4.12. Disentangling Modality Scaling and Structural Contributions

To clarify whether the observed performance gains stem from increased modality coverage or from the proposed architectural design, we conduct two complementary analyses: (i) modality scaling versus structural module contribution, and (ii) fine-grained ablation of key design components. (1) Modality Scaling vs. Module Gain. As shown in Table 16, we introduce a simple multi-task baseline, denoted as MLP-MT, which adopts a shared-backbone MLP with three task-specific heads under the same training protocol. Within this controlled setting, progressively adding modalities (EEG → EEG + Peripheral → EEG + Peripheral + EOG) consistently improves performance from 83.6% to 86.8% in average accuracy. This confirms that richer physiological modality coverage contributes positively to affective recognition. However, under the identical full-modality input (EEG + Peripheral + EOG), our full model substantially outperforms MLP-MT, achieving 92.7% average accuracy compared to 86.8%. This 5.9% absolute improvement demonstrates that the performance gain cannot be attributed solely to modality scaling but is primarily driven by the proposed structured representation learning framework. Notably, this improvement is achieved with moderate computational cost, indicating a favorable accuracy–efficiency trade-off.

(2) Fine-Grained Design Analysis. To further identify the contribution of individual components, Table 3 reports detailed ablation results under identical modality and training settings. First, removing either the Channel-Specific Interaction (CSI) or Instance-Specific Interaction (ISI) branch leads to consistent degradation. In particular, removing ISI causes a larger drop, highlighting the importance of frequency-aware modeling for fine-grained affect discrimination. Removing both branches results in the most pronounced decline, confirming their complementary roles. Second, replacing the learnable tokenization matrix

Q_{i}

with fixed average pooling reduces performance, while using a frozen random

Q_{i}

only partially recovers it. This indicates that adaptive token aggregation is crucial, and the improvement is not merely due to dimensionality reduction.

Third, varying the number of tokens per modality reveals that too few tokens degrade representation capacity, whereas increasing token numbers beyond the default setting provides only marginal benefit. The default asymmetric configuration (4/2/1 tokens for EEG, Peripheral, and EOG, respectively) is motivated by modality complexity and dimensionality. EEG contains rich spatial–temporal dynamics across 32 channels, thus requiring higher representational capacity, whereas peripheral signals are heterogeneous but lower-dimensional, and EOG contains only two channels with relatively simpler dynamics. As shown in Table 3, reducing token numbers (2/1/1) degrades representation capacity, while substantially increasing tokens (8/4/2) introduces additional computational cost with only marginal performance gain.

In addition, we compare single-task training (three independent models) with the unified multi-task formulation. As shown in Table 3, single-task training achieves slightly higher accuracy on Valence (92.9%) and Arousal (92.0%), while the multi-task model performs better on Liking (93.6% vs. 92.5%). Despite these minor differences, the unified multi-task framework maintains competitive performance across all dimensions while enabling shared representation learning and a single deployment model. This indicates that cross-task interaction introduces complementary affective cues without causing substantial negative interference.

Importantly, the additional computational overhead introduced by CSI and ISI remains moderate. Compared with the token-only backbone, the structured modules increase parameters and FLOPs only marginally while yielding consistent accuracy gains, indicating a favorable performance–efficiency trade-off.

This suggests that the chosen configuration achieves a balanced trade-off between expressiveness and redundancy rather than being the outcome of exhaustive hyperparameter search. Fourth, among different normalization strategies, Instance Normalization (IN) achieves the best performance, likely because it better accommodates subject-specific variability in physiological signals compared with BN or LN. Finally, to validate whether ISI genuinely learns adaptive spectral selectivity, we compare the learnable frequency gating mechanism with fixed band-pass filtering and a variant without gating. Although fixed band-pass filtering provides moderate improvement, it remains inferior to learnable frequency gating. This confirms that ISI captures task-adaptive frequency patterns beyond handcrafted spectral priors. Overall, the combined results in Table 3 and Table 16 demonstrate that the proposed framework derives its advantage from structured cross-channel interaction, adaptive token formation, and learnable frequency-aware modeling rather than merely increasing the number of input modalities.

4.13. Window Overlap Effect Analysis

To examine whether overlapping windows artificially inflate performance, we compare the proposed framework under two segmentation strategies: (i) overlapping windows (10 s length, stride = 1 s) and (ii) non-overlapping windows (10 s length, stride = 10 s). Subject-level partitioning is strictly performed before window segmentation, ensuring that no subject appears in both training and testing sets. All other settings remain identical.

As shown in Table 17, switching to non-overlapping windows leads to a moderate performance drop of approximately 1–1.5% across all affective dimensions. This decline is expected due to reduced sample density, yet the overall performance remains consistently high. These results indicate that the reported gains are not driven by overlap-induced sample inflation but arise from the proposed structured multi-modal representation design.

5. Potential Applications and Limitations

While the proposed framework demonstrates strong performance on the DEAP benchmark dataset, we emphasize that the current study is limited to offline evaluation under controlled experimental conditions. No clinical data, real-time deployment experiments, or cross-dataset validations were conducted in this work. Therefore, the following application scenarios should be interpreted as potential research directions rather than validated clinical solutions.

Prospective Therapeutic Monitoring: The model may provide a quantitative foundation for analyzing trends in valence, arousal, and liking signals during music exposure. In future clinical settings, such physiological-based prediction frameworks could potentially assist clinicians in assessing emotional responses during therapy sessions. However, clinical feasibility and reliability would require dedicated validation on patient cohorts.
Adaptive Affective Computing Systems: Given its multi-modal fusion capability, the proposed framework could be extended toward affective-aware music recommendation or interactive systems. Nevertheless, real-time deployment would require latency optimization, hardware integration, and robustness testing under unconstrained environments.
Biofeedback-Driven Interfaces: The architecture may serve as a computational backbone for biofeedback-driven VR or immersive environments. Future work should investigate real-time physiological acquisition, noise robustness, and user-dependent calibration mechanisms.
Clinical Decision Support (Long-Term Vision): With longitudinal physiological recordings and validated clinical protocols, multi-task affective models could eventually contribute to clinical dashboards for emotional trend monitoring. Such applications, however, remain beyond the scope of the present DEAP-based study.

In summary, although the proposed framework exhibits promising discriminative capability on benchmark data, translating it into clinical or real-time systems requires substantial additional validation, including cross-dataset generalization, prospective clinical trials, and hardware-level deployment studies.

6. Conclusions

In this work, we proposed a structured multi-modal framework for joint emotion and preference recognition based on heterogeneous physiological signals. By introducing explicitly factorized representation learning and hierarchical cross-modal interaction, the proposed approach enables coherent modeling of temporal–spectral dynamics and inter-modality dependencies within a unified multi-task setting.

Extensive experiments on the DEAP benchmark demonstrate consistent improvements over strong baselines under subject-independent evaluation. The statistical analysis, ablation studies, and robustness evaluations collectively confirm the effectiveness and stability of the proposed structured modeling strategy.

Beyond empirical gains, this study highlights the importance of aligning architectural design with the intrinsic organizational properties of physiological signals. The results suggest that explicitly structured token-based modeling can serve as a principled approach for scalable multi-modal affect representation learning.

Future work will investigate cross-dataset generalization, longitudinal affect modeling, and validation under real-world constraints to further assess the applicability of structured multi-modal learning in affective computing. In parallel, a complementary theoretical research direction will focus on the formal robustness analysis of multi-modal physiological learning systems. Specifically, cross-modal interaction structures may be modeled as weighted interaction graphs, whose connectivity properties can be analyzed under modality-level perturbations or sensor failures. Building on this perspective, future studies will explore whether connectivity-based metrics can be defined for learned multi-modal fusion graphs, where nodes correspond to modality tokens and edges represent learned interaction strengths. Such a formulation may enable quantitative evaluation of robustness under modality dropout, adversarial sensor degradation, or partial signal corruption, thereby providing a potential bridge between graph-theoretic resilience analysis and data-driven representation learning in affective computing.

However, several limitations must be acknowledged when considering the transition from benchmark evaluation to real-world clinical deployment. First, the DEAP dataset is collected under controlled laboratory conditions with healthy participants, fixed stimulus durations, and standardized recording equipment. In contrast, clinical environments—such as music therapy for affective disorders—often involve heterogeneous patient populations, variable session lengths, motion artifacts, sensor displacement, and incomplete modality availability. The current study does not evaluate robustness under such real-world acquisition variability. Second, all experiments are conducted in an offline setting with pre-segmented windows and full-session recordings. Real-time therapeutic monitoring would require continuous streaming inference, latency-constrained processing, and adaptive calibration mechanisms to account for subject drift and physiological non-stationarity over longer treatment periods. These practical deployment constraints remain untested. Third, no cross-dataset validation or prospective clinical study is performed in this work. Although subject-independent splits reduce identity leakage, they do not fully reflect domain shift across acquisition devices, cultural backgrounds, or patient groups. The absence of external validation limits the immediate clinical generalizability of the proposed framework. To bridge this gap, future validation should proceed along several concrete directions: (1) cross-dataset evaluation across multiple physiological emotion benchmarks to quantify domain robustness; (2) pilot studies involving clinical populations to assess performance under realistic therapeutic settings; (3) real-time system prototyping with streaming data pipelines and latency profiling; and (4) systematic robustness testing under modality dropout, sensor noise, and motion-induced artifacts.

Author Contributions

Conceptualization, W.Q. and M.-J.-S.W.; methodology, W.Q.; software, W.Q.; validation, W.Q. and M.-J.-S.W.; formal analysis, W.Q.; investigation, W.Q.; resources, M.-J.-S.W.; data curation, W.Q.; writing—original draft preparation, W.Q.; writing—review and editing, M.-J.-S.W.; visualization, W.Q.; supervision, M.-J.-S.W.; project administration, M.-J.-S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

The dataset used in this study is publicly available at: https://github.com/athar70/MLSTM (accessed on 23 June 2025).

Acknowledgments

The authors would like to thank the administrative and technical staff of the College of Arts Management, Shandong University of Arts, and the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, for their support during this study. During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5.2) for language polishing and formatting assistance. The authors have reviewed and edited the generated content and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, Y.; Wu, Q.M.J.; Zheng, W.L.; Lu, B.L. EEG-Based Emotion Recognition Using Hierarchical Network with Subnetwork Nodes. IEEE Trans. Cogn. Dev. Syst. 2018, 10, 408–419. [Google Scholar] [CrossRef]
Wang, X.W.; Nie, D.; Lu, B.L. Emotional State Classification from EEG Data Using Machine Learning Approach. Neurocomputing 2014, 129, 94–106. [Google Scholar] [CrossRef]
Li, C.; Xu, C.; Feng, Z. Analysis of Physiological Signals for Emotion Recognition with the IRS Model. Neurocomputing 2016, 178, 103–111. [Google Scholar] [CrossRef]
Verma, G.K.; Tiwary, U.S. Multimodal Fusion Framework: A Multiresolution Approach for Emotion Classification and Recognition from Physiological Signals. NeuroImage 2014, 102, 162–172. [Google Scholar] [CrossRef]
Liu, W.; Zheng, W.L.; Lu, B.L. Emotion Recognition Using Multimodal Deep Learning. In Proceedings of the 23rd International Conference on Neural Information Processing (ICONIP), Kyoto, Japan; Springer: Cham, Switzerland, 2016; pp. 521–529. [Google Scholar]
Yin, Z.; Zhao, M.; Wang, Y.; Yang, J.; Zhang, J. Recognition of Emotions Using Multimodal Physiological Signals and an Ensemble Deep Learning Model. Comput. Methods Programs Biomed. 2017, 140, 93–110. [Google Scholar] [CrossRef]
Ma, J.; Tang, H.; Zheng, W.L.; Lu, B.L. Emotion Recognition Using Multimodal Residual LSTM Network. In Proceedings of the 27th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2019; pp. 176–183. [Google Scholar]
Hassan, M.M.; Alam, M.G.R.; Uddin, M.Z.; Huda, S.; Almogren, A.; Fortino, G. Human Emotion Recognition Using Deep Belief Network Architecture. Inf. Fusion 2019, 51, 10–18. [Google Scholar] [CrossRef]
Liu, W.; Qiu, J.L.; Zheng, W.L.; Lu, B.L. Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis. arXiv 2019, arXiv:1908.05349. [Google Scholar] [CrossRef]
Zhang, Y.; Cheng, C.; Zhang, Y. Multimodal Emotion Recognition Based on Manifold Learning and Convolution Neural Network. Multimed. Tools Appl. 2022, 81, 33253–33268. [Google Scholar] [CrossRef]
Tang, J.; Ma, Z.; Gan, K.; Zhang, J.; Yin, Z. Hierarchical Multimodal Fusion of Physiological Signals for Emotion Recognition with Scenario Adaption and Contrastive Alignment. Inf. Fusion 2024, 103, 102129. [Google Scholar] [CrossRef]
Li, Q.; Jin, D.; Huang, J.; Zhong, Q.; Xu, L.; Lin, J.; Jiang, D. DEMA: Deep EEG-first multi-physiological affect model for emotion recognition. Biomed. Signal Process. Control 2025, 99, 106812. [Google Scholar] [CrossRef]
Lan, Z.; Sourina, O.; Wang, L.; Liu, Y. Real-Time EEG-Based Emotion Monitoring Using Stable Features. Vis. Comput. 2016, 32, 347–358. [Google Scholar] [CrossRef]
Li, Y.; Zheng, W.; Cui, Z.; Zong, Y.; Ge, S. EEG Emotion Recognition Based on Graph Regularized Sparse Linear Regression. Neural Process. Lett. 2019, 49, 555–571. [Google Scholar] [CrossRef]
Cheng, J.; Chen, M.; Li, C.; Liu, Y.; Song, R.; Liu, A.; Chen, X. Emotion Recognition from Multi-Channel EEG via Deep Forest. IEEE J. Biomed. Health Inform. 2020, 25, 453–464. [Google Scholar] [CrossRef] [PubMed]
Song, T.; Zheng, W.; Song, P.; Cui, Z. EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks. IEEE Trans. Affect. Comput. 2020, 11, 532–541. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, H.; Zhang, D.; Chen, X.; Qin, T.; Zheng, Q. EEG-Based Emotion Recognition with Emotion Localization via Hierarchical Self-Attention. IEEE Trans. Affect. Comput. 2023, 14, 2458–2469. [Google Scholar] [CrossRef]
Ding, Y.; Tong, C.; Zhang, S.; Jiang, M.; Li, Y.; Lim, K.J.; Guan, C. Emt: A novel transformer for generalized cross-subject eeg emotion recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 10381–10393. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, S.; Tang, C.; Guan, C. Masa-tcn: Multi-anchor space-aware temporal convolutional neural networks for continuous and discrete eeg emotion recognition. IEEE J. Biomed. Health Inform. 2024, 28, 3953–3964. [Google Scholar] [CrossRef]
Xiao, M.; Zhu, Z.; Xie, K.; Jiang, B. Meeg and at-dgnn: Improving eeg emotion recognition with music introducing and graph-based learning. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: Piscataway, NJ, USA, 2024; pp. 4201–4208. [Google Scholar]
Liu, S.; Zhao, Y.; An, Y.; Zhao, J.; Wang, S.H.; Yan, J. GLFANet: A Global to Local Feature Aggregation Network for EEG Emotion Recognition. Biomed. Signal Process. Control 2023, 85, 104799. [Google Scholar] [CrossRef]
Jin, H.; Gao, Y.; Wang, T.; Gao, P. DAST: A Domain-Adaptive Learning Combining Spatio-Temporal Dynamic Attention for Electroencephalography Emotion Recognition. IEEE J. Biomed. Health Inform. 2023, 28, 2512–2523. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Xie, Q.; Pan, J.; He, Y.; Wen, Z.; Yu, R.; Li, Y. An EEG-Based Brain–Computer Interface for Emotion Recognition and Its Application in Patients with Disorder of Consciousness. IEEE Trans. Affect. Comput. 2019, 12, 832–842. [Google Scholar] [CrossRef]
Gu, X.; Cao, Z.; Jolfaei, A.; Xu, P.; Wu, D.; Jung, T.P.; Lin, C.T. EEG-Based Brain–Computer Interfaces (BCIs): A Survey of Recent Studies on Signal Sensing Technologies and Computational Intelligence Approaches and Their Applications. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 1645–1666. [Google Scholar] [CrossRef]
Kandemir, M.; Vetek, A.; Gönen, M.; Klami, A.; Kaski, S. Multi-Task and Multi-View Learning of User State. Neurocomputing 2014, 139, 97–106. [Google Scholar] [CrossRef]
Tang, H.; Liu, W.; Zheng, W.L.; Lu, B.L. Multimodal Emotion Recognition Using Deep Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing (ICONIP), Guangzhou, China; Springer: Cham, Switzerland, 2017; pp. 811–819. [Google Scholar]
Kim, B.H.; Jo, S. Deep Physiological Affect Network for the Recognition of Human Emotions. IEEE Trans. Affect. Comput. 2020, 11, 230–243. [Google Scholar] [CrossRef]
Kusumaningrum, T.D.; Faqih, A.; Kusumoputro, B. Emotion Recognition Based on DEAP Database Using EEG Time-Frequency Features and Machine Learning Methods. J. Phys. Conf. Ser. 2020, 1501, 012020. [Google Scholar] [CrossRef]
Zhang, X.; Liu, J.; Shen, J.; Li, S.; Hou, K.; Hu, B.; Gao, J.; Zhang, T. Emotion Recognition from Multimodal Physiological Signals Using a Regularized Deep Fusion of Kernel Machine. IEEE Trans. Cybern. 2021, 51, 4386–4399. [Google Scholar] [CrossRef]
Liu, J.w.; Yang, D.; Feng, T.w.; Fu, J.j. MDFD2-DETR: A Real-Time Complex Road Object Detection Model Based on Multi-Domain Feature Decomposition and De-Redundancy. IEEE Trans. Intell. Veh. 2025, 10, 4343–4359. [Google Scholar] [CrossRef]
Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA; IEEE: Piscataway, NJ, USA, 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
Rana, R. Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech. arXiv 2016, arXiv:1612.07778. [Google Scholar] [CrossRef]
Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
Tang, W.; Long, G.; Liu, L.; Zhou, T.; Jiang, J.; Blumenstein, M. Rethinking 1d-cnn for time series classification: A stronger baseline. arXiv 2020, arXiv:2002.10061. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Sydney, Australia, 2017; Volume 30. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]

Figure 1. Structured multi-modal physiological modeling framework. (A) Real data acquisition and preprocessing pipeline, including signal segmentation, normalization, and down-sampling. (B) Conceptual structured multi-task modeling architecture with explicit cross-series interaction (CSI) and intra-series temporal–spectral interaction (ISI) modules.

Figure 2. Architecture of the Dynamic Token Feature Extractor (DTFE). DTFE processes input physiological signals via instance normalization, multi-dimensional projection, and token-based representation learning. The Cross-Series Intersection (CSI) module (highlighted in blue) models inter-channel token interactions, while the Intra-Series Intersection (ISI) module (highlighted in orange) captures intra-series temporal–spectral dynamics. The frequency gate performs element-wise spectral gating in the frequency domain, where

Z_{f}

denotes learnable frequency weights for band-wise modulation. Learnable tokens

Q_{i}

,

Q_{o}

,

Q_{w}

, and

Q_{f}

enable adaptive feature extraction across heterogeneous signal types.

Figure 2. Architecture of the Dynamic Token Feature Extractor (DTFE). DTFE processes input physiological signals via instance normalization, multi-dimensional projection, and token-based representation learning. The Cross-Series Intersection (CSI) module (highlighted in blue) models inter-channel token interactions, while the Intra-Series Intersection (ISI) module (highlighted in orange) captures intra-series temporal–spectral dynamics. The frequency gate performs element-wise spectral gating in the frequency domain, where

Z_{f}

denotes learnable frequency weights for band-wise modulation. Learnable tokens

Q_{i}

,

Q_{o}

,

Q_{w}

, and

Q_{f}

enable adaptive feature extraction across heterogeneous signal types.

Figure 3. Cross-modal interaction and fusion architecture integrating EEG, peripheral physiological, and eye movement features.

Figure 4. Performance comparison across different models on the Valence dimension under the 9-class setup. Accuracy, F1-score, and Precision are reported.

Figure 5. Performance comparison across different models on the Arousal dimension under the 9-class setup. Accuracy, F1-score, and Precision are reported.

Figure 6. Performance comparison across different models on the Liking dimension under the 9-class setup. Accuracy, F1-score, and Precision are reported.

Table 1. Comparison of recent multi-physiological emotion recognition methods in terms of task structure, emotion categories, and input modalities.

Method	Task Type	Emotion Categories	Modalities
SVM [4]	Single	5-class	EEG, EDA, GSR, SCR, skin temp
MT-MKL [25]	Single	2-class	EEG, GSR, RB, skin temp
IRS [3]	Single	4-class	ECG, GSR, PPG
BDAE [5]	Single	2-class	EEG, eye movement
Bimodal-LSTM [26]	Single	2-class	EDA, PPG, EMG
MESAE [6]	Single	5-class	EEG, EOG, EMG, GSR, temp, BP
DCCA [9]	Single	2/4-class	EEG, eye, GSR, EMG, PPG
MM-ResLSTM [7]	Single	2-class	EEG, peripheral signals
FGSVM [8]	Single	5-class	EDA, PPG, EMG
DPAN [27]	Single	2-class	EDA, PPG
Random Forest [28]	Single	2-class	EMG, EOG
RDFKM [29]	Single	2-class	EEG, EMG, GSR, RES
i-Isomap + DCNN [10]	Single	4-class	EEG, peripheral, eye
RHRPNet [11]	Single	2/4-class	EEG, peripheral signals
DEMA [12]	Single	2/5-class	EEG, GSR, BP, RB
Our Method	Multi-task	3 × 9-class	EEG, GSR, BVP, EMG-Zyg, EMG-Trap, Resp., Temp., EOG

Note: Bold indicates the proposed method in this study.

Table 2. Multi-task classification results on DEAP dataset (9-class setup) with estimated computational costs two 4090 GPUs. All models are evaluated under a unified multi-task framework with shared classification heads.

Model	Valence			Arousal			Liking			Params (M)	FLOPs (G)	Train Time (min)
Model	Acc	F1	Prec	Acc	F1	Prec	Acc	F1	Prec	Params (M)	FLOPs (G)	Train Time (min)
LSTM [31]	85.6	87.3	86.1	84.5	87.0	85.6	83.2	86.1	84.4	1.62	0.24	56
BiLSTM [32]	87.0	83.6	85.0	83.1	84.2	85.3	84.1	83.8	82.5	2.45	0.40	74
GRU [33]	86.4	87.1	86.7	85.0	85.3	84.9	86.0	84.0	83.7	1.38	0.20	52
MLP [34]	85.8	86.2	85.5	84.3	85.1	84.4	85.1	83.5	83.2	0.62	0.06	38
1D-CNN [35]	84.0	84.6	83.9	83.7	83.5	82.8	85.0	84.0	83.1	1.08	0.16	46
XGBoost [36]	83.2	84.1	83.0	87.2	89.0	88.1	83.0	86.4	85.2	0.28	0.02	26
CNN-LSTM [37]	83.7	84.5	84.2	83.5	87.1	85.5	84.3	86.0	85.1	3.12	0.58	96
Transformer [38]	88.4	85.0	85.7	90.1	89.5	89.6	84.2	87.3	86.4	3.78	1.05	112
EmT [18]	91.1	89.8	89.4	90.3	91.6	90.8	88.5	89.9	89.1	3.92	3.18	125
MASA-TCN [19]	89.6	88.4	88.1	88.9	90.3	89.5	86.7	88.2	87.4	2.41	2.02	85
AT-DGNN [20]	90.3	89.1	88.7	89.7	91.0	90.2	87.9	89.3	88.5	3.18	2.45	98
Ours (Multi-Task)	92.8	92.6	92.2	91.8	93.7	93.0	93.6	92.4	91.9	2.86	0.82	88

Note: Bold indicates the proposed method in this study.

Table 3. Fine-grained ablation study of key design choices (9-class multi-task). All variants use the same modalities (EEG + Peripheral + EOG) and the same training protocol unless otherwise specified.

Variant	Acc (%)			Avg
Variant	Val	Aro	Like	Avg
Ours (Full)	92.8	91.8	93.6	92.7
(0) Single-task vs. Multi-task
Single-task (Val only)	92.9	–	–	N/A
Single-task (Aro only)	–	92.0	–	N/A
Single-task (Like only)	–	–	92.5	N/A
Multi-task (shared backbone)	92.8	91.8	93.6	92.7
(1) CSI vs. ISI contributions
w/o CSI (keep ISI)	91.2	90.4	92.0	91.2
w/o ISI (keep CSI)	90.6	89.3	91.1	90.3
w/o CSI & ISI (token only)	88.9	87.8	89.6	88.8
(2) Effect of learnable tokenization matrix $Q_{i}$
Replace $Q_{i}$ with AvgPool (fixed pooling)	91.5	90.1	92.1	91.2
Random $Q_{i}$ (frozen, not learned)	91.8	90.6	92.5	91.6
(3) Token number sensitivity (EEG/Periph/EOG tokens)
(2/1/1) tokens	91.3	90.2	92.2	91.2
(4/2/1) tokens (default)	92.8	91.8	93.6	92.7
(8/4/2) tokens	92.6	91.6	93.4	92.5
(4) Normalization choice in DTFE
BatchNorm (BN)	91.0	89.9	91.8	90.9
LayerNorm (LN)	92.1	91.0	92.9	92.0
InstanceNorm (IN, default)	92.8	91.8	93.6	92.7
(5) ISI frequency gating: learnable $Q_{f}$ vs. fixed band-pass
ISI w/o frequency gating (FFT + IFFT only)	91.0	89.6	91.9	90.8
Fixed band-pass filtering	91.8	90.7	92.6	91.7
Learnable frequency gating $Q_{f}$ (default)	92.8	91.8	93.6	92.7

Note: Bold indicates the proposed method in this study, while italics denote different task settings.

Table 4. Comparison with prior works evaluated on DEAP (accuracy %). Results of prior methods are directly reported from the respective papers. Our method reports mean ± standard deviation over 10 independent subject-wise splits.

Method	Valence (%)	Arousal (%)	Q1–Q4 + Neutral (%)
SVM [4]	81.45	–	–
MT-MKL [25]	60.00	58.00	–
BDAE [5]	85.20	80.50	–
Bimodal-LSTM [26]	83.82	83.23	–
MESAE [6]	83.04	84.18	84.18
DCCA [9]	85.62	84.33	–
MM-ResLSTM [7]	92.30	92.87	–
FGSVM [8]	–	–	89.53
DPAN [27]	78.72	79.03	–
Random Forest [28]	62.58	–	–
RDFKM [29]	64.50	63.10	–
i-Isomap + DCNN [10]	–	–	90.05
RHRPNet [11]	74.17	74.34	–
DEMA [12]	97.55	97.61	97.01
Ours (Single-task)	98.21 ± 0.29	98.34 ± 0.31	97.85 ± 0.26

Note: Bold indicates the proposed method in this study.

Table 5. Empirical class distribution of 9-point affective ratings in DEAP.

Score	1	2	3	4	5	6	7	8	9	Total
Ratio (%)	6	8	12	15	18	15	12	8	6	100

Table 6. Confusion matrix (%) of the proposed multi-task model on Valence (9-class).

True∖Pred	C1	C2	C3	C4	C5	C6	C7	C8	C9
C1	95.4	1.2	0.8	0.7	0.6	0.5	0.4	0.3	0.1
C2	1.5	94.1	1.3	0.8	0.7	0.6	0.5	0.3	0.2
C3	0.9	1.4	95.0	0.9	0.6	0.5	0.4	0.2	0.1
C4	0.8	1.0	1.1	89.3	3.2	2.1	1.2	0.8	0.5
C5	0.7	0.8	0.9	3.5	88.6	2.9	1.3	0.8	0.5
C6	0.6	0.7	0.8	2.8	3.1	89.8	1.4	0.6	0.2
C7	0.5	0.6	0.4	1.2	1.4	1.3	94.2	0.3	0.1
C8	0.3	0.4	0.3	0.7	0.9	0.8	0.5	95.8	0.3
C9	0.2	0.3	0.2	0.6	0.7	0.6	0.3	0.4	96.7

Table 7. Confusion matrix (%) of the proposed multi-task model on Arousal (9-class).

True∖Pred	C1	C2	C3	C4	C5	C6	C7	C8	C9
C1	93.8	1.4	1.0	0.9	0.8	0.7	0.5	0.5	0.4
C2	1.6	92.7	1.5	1.0	0.9	0.8	0.6	0.5	0.4
C3	1.0	1.5	93.4	1.1	0.9	0.8	0.6	0.4	0.3
C4	0.9	1.2	1.3	87.5	3.9	2.8	1.5	0.6	0.3
C5	0.8	1.0	1.1	4.2	86.9	3.4	1.6	0.7	0.3
C6	0.7	0.9	1.0	3.6	3.8	87.8	1.6	0.4	0.2
C7	0.6	0.7	0.6	1.5	1.6	1.7	93.1	0.1	0.1
C8	0.5	0.6	0.5	0.8	0.9	0.8	0.3	95.2	0.4
C9	0.4	0.4	0.4	0.6	0.7	0.6	0.2	0.3	96.4

Table 8. Confusion matrix (%) of the proposed multi-task model on Liking (9-class).

True∖Pred	C1	C2	C3	C4	C5	C6	C7	C8	C9
C1	96.2	0.9	0.7	0.6	0.5	0.4	0.3	0.3	0.1
C2	1.0	95.1	1.0	0.8	0.7	0.6	0.4	0.3	0.1
C3	0.8	1.1	95.8	0.9	0.6	0.4	0.2	0.1	0.1
C4	0.7	0.9	1.0	90.8	3.2	1.8	0.9	0.5	0.2
C5	0.6	0.8	0.9	3.4	90.2	2.1	1.0	0.6	0.4
C6	0.5	0.7	0.6	2.0	2.3	91.3	1.2	0.9	0.5
C7	0.4	0.5	0.3	1.0	1.1	1.2	95.0	0.3	0.2
C8	0.3	0.4	0.2	0.6	0.7	0.8	0.4	96.3	0.3
C9	0.2	0.3	0.1	0.4	0.6	0.5	0.2	0.3	97.4

Table 9. Multi-task classification results on DEAP dataset (9-class setup). Results are reported as mean ± standard deviation over 10 independent subject-wise splits.

Model	Valence			Arousal			Liking
Model	Acc	F1	Prec	Acc	F1	Prec	Acc	F1	Prec
LSTM [31]	85.18 ± 0.38	86.92 ± 0.36	85.77 ± 0.35	84.07 ± 0.40	86.55 ± 0.37	85.14 ± 0.36	82.76 ± 0.39	85.72 ± 0.36	83.91 ± 0.35
BiLSTM [32]	86.52 ± 0.36	83.21 ± 0.39	84.63 ± 0.37	82.74 ± 0.40	83.81 ± 0.39	84.90 ± 0.36	83.58 ± 0.38	83.41 ± 0.38	82.06 ± 0.37
GRU [33]	86.03 ± 0.34	86.71 ± 0.35	86.28 ± 0.33	84.63 ± 0.37	84.94 ± 0.36	84.52 ± 0.35	85.58 ± 0.36	83.66 ± 0.37	83.29 ± 0.36
MLP [34]	85.32 ± 0.39	85.79 ± 0.37	85.12 ± 0.36	83.88 ± 0.40	84.74 ± 0.38	83.97 ± 0.37	84.69 ± 0.39	83.12 ± 0.38	82.81 ± 0.37
1D-CNN [35]	83.61 ± 0.40	84.21 ± 0.38	83.54 ± 0.36	83.28 ± 0.40	83.02 ± 0.39	82.33 ± 0.38	84.57 ± 0.39	83.63 ± 0.38	82.79 ± 0.37
XGBoost [36]	82.76 ± 0.40	83.65 ± 0.39	82.58 ± 0.38	86.81 ± 0.38	88.54 ± 0.36	87.76 ± 0.35	82.64 ± 0.40	85.97 ± 0.37	84.82 ± 0.36
CNN-LSTM [37]	83.21 ± 0.39	84.07 ± 0.37	83.81 ± 0.36	83.08 ± 0.40	86.66 ± 0.38	84.97 ± 0.36	83.94 ± 0.39	85.55 ± 0.37	84.67 ± 0.36
Transformer [38]	87.92 ± 0.32	84.61 ± 0.35	85.21 ± 0.34	89.63 ± 0.30	88.94 ± 0.31	89.02 ± 0.30	83.81 ± 0.35	86.91 ± 0.33	85.97 ± 0.32
EmT [18]	90.98 ± 0.31	89.62 ± 0.33	89.25 ± 0.32	90.16 ± 0.34	91.44 ± 0.31	90.56 ± 0.32	88.31 ± 0.36	89.74 ± 0.34	88.95 ± 0.33
MASA-TCN [19]	89.48 ± 0.35	88.27 ± 0.36	87.95 ± 0.35	88.73 ± 0.37	90.11 ± 0.35	89.32 ± 0.34	86.53 ± 0.38	88.04 ± 0.36	87.25 ± 0.35
AT-DGNN [20]	90.17 ± 0.33	88.94 ± 0.34	88.54 ± 0.33	89.56 ± 0.35	90.78 ± 0.32	90.01 ± 0.33	87.74 ± 0.36	89.16 ± 0.34	88.37 ± 0.33
Ours (Multi-Task)	92.47 ± 0.29	92.21 ± 0.31	91.88 ± 0.29	91.42 ± 0.33	93.18 ± 0.30	92.55 ± 0.30	93.08 ± 0.26	92.01 ± 0.30	91.63 ± 0.31

Note: Bold indicates the proposed method in this study.

Table 10. LOSO cross-validation results of the proposed method under the 9-class multi-task setting. Results are reported as mean ± STD across 32 folds, with 95% CI.

Metric	Valence	Arousal	Liking
Accuracy (%)	91.7 ± 2.1 (CI ± 0.73)	90.9 ± 2.3 (CI ± 0.80)	92.5 ± 1.9 (CI ± 0.66)
F1-score (%)	91.2 ± 2.0 (CI ± 0.69)	92.0 ± 2.1 (CI ± 0.73)	91.8 ± 1.8 (CI ± 0.62)
Precision (%)	90.8 ± 2.2 (CI ± 0.76)	92.6 ± 2.0 (CI ± 0.69)	91.4 ± 1.9 (CI ± 0.66)

Table 11. Multi-task classification results on DEAP dataset (9-class setup) under a subject-wise 7:3 train–test split.

Model	Valence			Arousal			Liking
Model	Acc	F1	Prec	Acc	F1	Prec	Acc	F1	Prec
LSTM [31]	84.3	85.8	84.6	83.2	85.4	84.1	82.1	84.7	83.0
BiLSTM [32]	85.6	82.4	83.7	82.0	83.1	84.0	83.0	82.6	81.5
GRU [33]	85.1	85.6	85.2	83.8	84.1	83.6	84.6	82.9	82.6
MLP [34]	84.7	84.9	84.2	83.1	83.8	83.2	84.0	82.4	82.0
1D-CNN [35]	82.9	83.5	82.7	82.6	82.4	81.9	83.8	82.7	81.9
XGBoost [36]	81.9	82.7	81.6	86.1	87.6	86.7	81.8	85.1	84.0
CNN-LSTM [37]	82.6	83.4	83.0	82.4	85.8	84.1	83.1	84.7	83.8
Transformer [38]	87.2	83.6	84.4	88.9	88.1	88.2	82.9	85.8	84.9
EmT [18]	90.6	89.3	88.8	90.0	91.2	90.4	88.9	90.0	89.3
MASA-TCN [19]	89.2	88.0	87.6	88.5	89.8	89.1	87.4	88.7	88.0
AT-DGNN [20]	90.0	88.8	88.4	89.4	90.6	89.9	88.2	89.5	88.8
Ours (Multi-Task)	91.7	91.4	91.0	90.6	92.5	91.8	92.4	91.2	90.7

Note: Bold indicates the proposed method in this study.

Table 12. Multi-task classification results on DEAP dataset (9-class setup) under a subject-wise 6:4 train–test split.

Model	Valence			Arousal			Liking
Model	Acc	F1	Prec	Acc	F1	Prec	Acc	F1	Prec
LSTM [31]	82.6	84.1	82.9	81.4	83.2	82.0	80.3	82.8	81.2
BiLSTM [32]	83.8	80.9	82.1	80.6	81.7	82.5	81.5	81.1	80.1
GRU [33]	83.5	84.0	83.6	82.1	82.4	81.9	82.7	81.2	80.9
MLP [34]	83.0	83.2	82.5	81.6	82.3	81.7	82.0	80.6	80.3
1D-CNN [35]	81.2	81.8	81.1	80.9	80.7	80.2	82.1	81.0	80.2
XGBoost [36]	80.1	80.9	79.8	84.7	86.1	85.3	80.4	83.7	82.5
CNN-LSTM [37]	81.5	82.3	81.9	81.1	84.2	82.8	81.7	83.3	82.3
Transformer [38]	85.6	82.1	82.8	87.4	86.8	86.9	81.4	84.0	83.1
EmT [18]	89.1	87.9	87.4	88.6	89.8	89.0	87.1	88.3	87.6
MASA-TCN [19]	87.8	86.6	86.1	87.0	88.3	87.6	85.6	86.9	86.2
AT-DGNN [20]	88.6	87.4	86.9	88.0	89.2	88.4	86.4	87.7	87.0
Ours (Multi-Task)	90.2	89.9	89.4	88.9	90.8	90.1	90.9	89.7	89.2

Note: Bold indicates the proposed method in this study.

Table 13. Detailed 9-Class Precision (%) for Multi-Task Setting.

Class	Valence	Arousal	Liking
1	88.4	88.1	87.6
2	89.3	88.8	88.4
3	90.5	89.7	89.1
4	91.1	90.4	89.8
5	92.0	91.3	90.6
6	90.8	90.6	90.2
7	89.7	89.5	89.4
8	90.4	89.9	89.9
9	89.6	89.0	89.6
Macro Avg	90.2	89.9	89.4

Table 14. Ablation study of DTFE and Cross-Modal Fusion on DEAP dataset.

Model Variant	Valence			Arousal			Liking
Model Variant	Acc (%)	F1 (%)	Prec (%)	Acc (%)	F1 (%)	Prec (%)	Acc (%)	F1 (%)	Prec (%)
w/o DTFE	88.5	87.3	86.8	87.1	85.9	85.5	88.0	86.2	85.4
w/o Cross-Modal Fusion	89.4	88.2	87.9	88.3	87.0	86.4	89.1	87.1	86.8
Ours (Full)	92.8	92.6	92.2	91.8	93.7	93.0	93.6	92.4	91.9

Note: Bold indicates the proposed method in this study.

Table 15. Comparison between fixed and learnable loss weights on DEAP dataset (9-class classification).

Loss Weight Type	Valence (%)			Arousal (%)			Liking (%)
Loss Weight Type	Acc	F1	Prec	Acc	F1	Prec	Acc	F1	Prec
Fixed Weights	90.3	89.7	88.9	89.1	89.4	88.7	89.5	88.8	88.0
Learnable Weights (Ours)	92.8	92.6	92.2	91.8	93.7	93.0	93.6	92.4	91.9

Note: Bold indicates the proposed method in this study.

Table 16. Disentangling the gains from modality scaling and the proposed modules on DEAP (9-class multi-task). MLP-MT denotes a simple shared-backbone MLP with three task-specific heads under the same training protocol.

Model	Modalities	Acc (%)			Avg	Cost
Model	Modalities	Val	Aro	Like	Avg	Params (M)	FLOPs (G)	Time (min)
MLP-MT	EEG	83.6	82.9	84.2	83.6	0.48	0.05	24
MLP-MT	EEG + Peripheral	85.9	85.3	86.7	86.0	0.56	0.07	28
MLP-MT	EEG + Peripheral + EOG	86.8	86.1	87.4	86.8	0.59	0.08	30
Ours (Full)	EEG + Peripheral + EOG	92.8	91.8	93.6	92.7	2.86	0.82	88

Interpretation: Modality gain is measured within MLP-MT when adding modalities (EEG → +Peripheral → +EOG). Module gain is measured by comparing Ours (Full) with MLP-MT under the same full-modality input. Bold indicates the proposed method in this study.

Table 17. Effect of window overlap on the proposed method (9-class multi-task).

Protocol	Val Acc (%)	Aro Acc (%)	Like Acc (%)
Overlapping (stride = 1 s)	92.47	91.42	93.08
Non-overlapping (stride = 10 s)	91.27	90.12	91.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qu, W.; Wang, M.-J.-S. Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry 2026, 18, 488. https://doi.org/10.3390/sym18030488

AMA Style

Qu W, Wang M-J-S. Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry. 2026; 18(3):488. https://doi.org/10.3390/sym18030488

Chicago/Turabian Style

Qu, Wenli, and Mu-Jiang-Shan Wang. 2026. "Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference" Symmetry 18, no. 3: 488. https://doi.org/10.3390/sym18030488

APA Style

Qu, W., & Wang, M.-J.-S. (2026). Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry, 18(3), 488. https://doi.org/10.3390/sym18030488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference

Abstract

1. Introduction

2. Related Work

2.1. Emotion Recognition Based on EEG Signals

2.2. Emotion Recognition by Merging Multiple Physiological Signals

2.3. Summary and Motivation

3. Methodology

3.1. Dynamic Token Feature Extractor (DTFE)

3.1.1. Module Architecture

Input Formulation

Instance Normalization

3.1.2. Multi-Dimensional Projection

Channel-Wise Projection to Latent Tokens

3.1.3. Token-Based Processing

Learnable Temporal Tokenization

Design Rationale

3.1.4. Representation Learning

Intuitive Motivation

Cross-Series Intersection (CSI)

Intra-Series Intersection (ISI)

Time-Domain Projection and Nonlinearity

Frequency-Domain Transformation

Learnable Frequency Gating

Inverse Transformation and Residual Fusion

Design Rationale

3.1.5. Output Generation

Temporal Aggregation via Learnable Attention Pooling

Output Embedding Projection

3.2. Cross-Modal Interaction and Fusion

3.2.1. Multi-Task Loss Function

Overall Objective

Task-Specific Loss Formulation

Hyperparameter Selection and Sensitivity Analysis

4. Experiments

4.1. Dataset and Task Setup

4.2. Class Distribution Analysis

4.3. Training Configuration

4.4. Comparison with Prior Works on DEAP Dataset

4.5. Multi-Task Classification Results and Analysis

4.6. Ordinal Structure Analysis and Modeling Justification

4.7. Statistical Performance Analysis

4.8. Robustness Analysis Under Different Train–Test Splits

4.9. Detailed 9-Class Precision Analysis

4.10. Ablation Study on DTFE and Fusion Modules

4.11. Ablation Study: Learnable vs. Fixed Loss Weights

4.12. Disentangling Modality Scaling and Structural Contributions

4.13. Window Overlap Effect Analysis

5. Potential Applications and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI